← Back to main page

Mini-Fold — Protein Structure Diffusion Model

Full architecture history, training tables, loss functions, and analysis for the protein backbone diffusion model. View Structures Gallery →

Last updated: Mar 18, 2026 at 9:30 PM PST

FrameDiff v3: Paper-Correct SE(3) Backbone Diffusion

FrameDiff v3 — 24.9M params, 8 layers, CATH 4.3. TRAINING

Training Progress (selected epochs)

EpochL_xL_rRot ErrRMSDVal TotalNotes
02.0440.90877.8°12.73 A3.228
101.3000.68168.1°9.80 A1.817
201.2800.65666.7°9.64 A1.935
301.2880.65066.4°9.63 A1.974
401.2640.64265.9°9.53 A1.869
501.2420.63765.6°9.43 A1.762
601.2230.63465.3°9.35 A1.733
701.2190.63265.3°9.34 A1.960
801.2120.62965.0°9.29 A1.797
901.2140.62664.8°9.26 A1.782
1001.19734964.8°9.18 A1.855L_r spike
1101.1880.62064.5°9.14 A1.866
1201.2090.62265.1°9.24 A1.701
1301.1890.61864.5°9.12 A1.770
1401.1840.61364.0°9.05 A1.782
1501.1900.61364.0°9.06 A1.662
1521.1840.61063.7°8.98 A1.602BEST
1601.1770.61063.7°9.02 A1.767
1701.1790.61264.0°9.02 A1.840
1801.1750.60763.5°8.96 A1.959
1901.1800.61063.8°9.00 A1.753
2001.2040.61764.2°9.12 A1.768
2101.1940.62064.6°9.17 A1.808
2201.1850.61664.2°9.07 A1.840
2301.1950.62064.7°9.15 A1.705
2331.2080.61964.6°9.16 A1.693
2341.1800.61263.6°9.01 A1.803
2351.1960.61864.4°9.13 A1.801Latest

Last updated: Mar 20, 2026 at 08:36 AM PST. Epoch 235. Training ongoing (~13 min/epoch). 3 L_r spikes in 29 epochs (IGSO3 score at extreme angles).

Training Curves

Note: L_r spikes at epochs 2, 6, 7 (IGSO3 score instability at extreme angles) are excluded from charts. Val epoch 8 spike also excluded. Model converges stably — gradient clipping + NaN safety prevent divergence.

Loss Functions (Paper Sec. 4.2)

Total = Lx + Lr + 0.25 · 𝟙{t < 0.25} · (Lbb + L2D)

LossFormulaDescription
Lx||x̂(0) − x(0)||²Translation MSE (Cα x0-prediction)
Lrλr(t) · ||spred − strue||²IGSO3 denoising score matching. s = (df/dω)/f · axis. λr = 1/E[||score||²]
Lbb(1/4N)∑||â − a||²Backbone atom MSE over {N, Cα, C, O}, gated at t < 0.25
L2D(1/Z)∑ 𝟙{d<6Å} · (d̂−d)²All-atom pairwise distance MSE within 6Å cutoff, gated at t < 0.25

Model Configuration

Parameters
24.9M (8 layers, d_node=256, d_edge=128)
IPA
8 heads, 8 query points, 12 value points, d_head=16, AF2 w_C constant
Transformer
4 heads × 2 layers, 1024 FFN, Pre-LN, operates on (d_node + d_skip)
Rotation Noise
IGSO3 with log sigma schedule (σmin=0.1, σmax=1.5)
Translation Noise
VP-SDE linear beta (βmin=0.1, βmax=20.0)
Optimizer
Adam, lr=1e-4, warmup 1000 steps, cosine decay → 1e-6
Training
EMA 0.999, grad clip 1.0, batch_size 8, self-conditioning 50%, stop_grad_frames

Architecture Breakdown (Paper App. I.2, Fig. 2)

Each of L=8 refinement layers follows: IPA → LN → concat(hipa, Linear(h0)) → Transformer → Linear + residual → MLP → EdgeUpdate → BackboneUpdate

1. Node Init
sin(res_idx, 128) + sin(t, 128) → 3-layer MLP(256→256) with ReLU + LayerNorm
2. Edge Init
φ(n) + φ(m) + φ(m−n) + φ(t) + φ(disp_sc) = 320 → 3-layer MLP → d_edge=128. Distogram: 22-bin cumulative indicator (2.0nm max)
3. IPA
Invariant Point Attention (AF2 Alg. 22). Points transformed through per-residue SE(3) frames. Output concatenates: scalar values (H×D) + 3D local-frame points (H×Vp×3) + point norms (H×Vp) + pair-weighted values (H×D) = 640 → Linear → 256
4. Transformer
Input: concat(h_ipa, Linear(h_0)) of dim (256+64). 2× Pre-LN self-attention (4 heads) + FFN(1024). Output projected back to d_node via post_trans_proj + h_ipa residual
5. EdgeUpdate
Separate src/tgt projections to D_h/2, concat with z → 3-layer MLP + LayerNorm + residual + trailing Linear
6. BackboneUpdate
Single Linear → (b,c,d, dx). Quaternion q=(1,b,c,d)/||·||, R_new = R × dR (Gram-Schmidt re-orthog), x_new = x + R×dx (re-centered)
7. TorsionHead
3-layer MLP(d_node) + h_L skip connection → Linear → (sinψ, cosψ) normalized to unit circle. Feeds directly into R_x(ψ) for O atom placement

Paper Compliance Review

Automated review against Yim et al. 2302.02277v3. All core algorithms verified correct.

✓ IGSO3 density, score, sampling
✓ IPA (all 4 output components)
✓ Rotation DSM loss + λr
✓ VP-SDE translation diffusion
✓ Reverse SDE sampling (Alg. 1)
✓ Self-conditioning + aux losses
⚠ Missing MLP residual (model.py:773)
⚠ Intermittent L_r spikes (3/29 epochs)

Designability Evaluation (100 backbones)

100× L=100 backbones, 500 reverse SDE steps, ζ=0.1. Each backbone sequence-designed with ProteinMPNN (8 seqs, temp=0.1, rm_aa=C) then validated with AlphaFold2 (model_1_ptm, 3 recycles). Metrics computed on best-of-8 AF2 predictions per backbone.

0.0%
scRMSD < 2Å
1.0%
scRMSD < 5Å
37.0%
scTM > 0.5
26.0%
pLDDT > 0.7
15.4Å
Median scRMSD
scRMSD distribution

Best Designable Backbones (scRMSD < 10Å)

RankMPNNpLDDTpTMPAEscRMSD (Å)Best Sequence
10.9160.9480.5368.64.04LSGLLLLLLELLLLLLLLLLLELLLLLLLLLQLLLELLQLLLLLELLLLLLLLQLLLLLLLLLLLLLLLQLLLQLLLLLLLLQQQQLQQLLQLQLLLQLE
21.4520.8750.6495.58.70LLAGLLGLLLLLLLPLLLLLLLLLLAALLPELLLELLLLAALLLLLLLLLPLLLLLALLLLLLLLGLLLLLDPALALLLLLPLLLLLLLALLLLLELLLL

Top: Generated backbone (poly-GLY). Bottom: AF2 prediction of best MPNN sequence. Drag to rotate, scroll to zoom.

Backbone — scRMSD=4.04Å

Backbone — scRMSD=8.70Å

AF2 prediction — pLDDT=0.948

AF2 prediction — pLDDT=0.875

Sequence Logos (MPNN-designed sequences)

Sequence logo best 1
Sequence logo best 2

Both backbones show strong leucine dominance with positional conservation of charged residues (E, K, Q). The best design (4.04Å) is almost entirely helical (L-rich), while the 2nd (8.70Å) shows more structural diversity with proline kinks and glycine turns.

Hourly Tracking (auto-generated)

17 samples over epochs ?. Showing 5 of 17 runs. One random sample per run, ProteinMPNN → AF2 designability.

15.1Å
Med RMSD
10.9Å
Best RMSD
0.446
Med pLDDT
0.823
Best pLDDT
0.257
Med pTM
0.532
Best pTM
TimeEpochMPNNpLDDTpTMPAERMSD
mar 18 10:23pm?1.3110.8230.5329.729.0
mar 19 12:23am?0.8330.3910.22418.014.4
mar 19 3:29am?0.7480.3570.17619.815.1
mar 19 5:31am?0.9400.6490.42014.416.5
mar 19 8:29am?1.1620.4460.25717.714.9

mar 18 10:23pm

mar 19 12:23am

mar 19 3:29am

mar 19 5:31am

mar 19 8:29am

AF2

AF2

AF2

AF2

AF2

FrameDiff v4: Expanded Data Ablation

Same 24.9M model as v3, same schedule. max_len=512 → 17,547 train chains (vs v3's 10,125). Tests whether data size is the bottleneck. PENDING

Comparison: v3 vs v4

v3v4
Model8L, 24.9M, d=2568L, 24.9M, d=256 (identical)
Noiseσmax=1.5, βmax=20σmax=1.5, βmax=20 (identical)
Max length256512
Train chains10,12517,547 (+73%)
Val chains448580
Hardware2× A402× A40

Training Progress

EpochL_xL_rRot ErrRMSDVal TotalNotes
02.1099.46176.0°12.75 A2.542
51.5600.68268.1°10.58 A2.106
101.5230.66066.8°10.41 A1.980
121.5240.65566.5°10.36 A1.929BEST
151.4850.64866.1°10.25 A2.096
201.5170.64866.4°10.35 A2.013
241.4950.64265.9°10.24 A2.044
251.4920.64165.9°10.21 A2.030
261.4870.64065.7°10.19 A2.160Latest

Last updated: Mar 20, 2026 at 08:36 AM PST. Epoch 26. Training ongoing.

Training Curves

Charts will appear once training starts.

Designability Evaluation (96 backbones, epoch 0)

96× L=100 backbones, 500 reverse SDE steps, ζ=0.1. Each backbone sequence-designed with ProteinMPNN (8 seqs, temp=0.1, rm_aa=C) then validated with AlphaFold2 (model_1_ptm, 3 recycles). Metrics computed on best-of-8 AF2 predictions per backbone.

0.0%
scRMSD < 2Å
0.0%
scRMSD < 5Å
49.0%
scTM > 0.5
45.8%
pLDDT > 0.7
18.0Å
Median scRMSD
scRMSD distribution

vs v3: v4 has higher AF2 confidence (49% scTM>0.5 vs 37%, 46% pLDDT>0.7 vs 26%) but worse self-consistency (median 18.0Å vs 15.4Å, 0% <10Å vs 3%). The expanded dataset (17.5K chains) produces more protein-like backbones that AF2 folds confidently — but into different structures. At epoch 0, the model hasn't converged enough for self-consistency.

Top 2 Designs (by scRMSD)

RankMPNNpLDDTpTMPAEscRMSD (Å)Best Sequence
11.2610.6140.36715.010.83AAEGAAAAALLAAGAAAAGAAAAEVGGGAAAAAAAGAGAGAAAALEAELAGKAAEEGGVAEAAEEEKELKEKVKEAELEEEILKKKKGGLAAGAGGGGGL
21.3550.6240.45913.011.24GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

Top: Generated backbone. Bottom: AF2 prediction. Drag to rotate, scroll to zoom.

Backbone — scRMSD=10.83Å

Backbone — scRMSD=11.24Å

AF2 prediction — pLDDT=0.614

AF2 prediction — pLDDT=0.624

Sequence Logos (MPNN-designed sequences)

Sequence logo best 1
Sequence logo best 2

Hourly Tracking (auto-generated)

17 samples over epochs 12–?. Showing 5 of 17 runs. One random sample per run, ProteinMPNN → AF2 designability.

17.5Å
Med RMSD
13.2Å
Best RMSD
0.653
Med pLDDT
0.819
Best pLDDT
0.343
Med pTM
0.642
Best pTM
TimeEpochMPNNpLDDTpTMPAERMSD
mar 18 10:23pm?0.8800.4010.15920.615.3
mar 19 2:30am?1.5340.7540.39716.613.7
mar 19 6:29am?1.6910.7960.5509.520.5
mar 19 12:28pm?1.6680.6810.44414.020.4
mar 19 5:23pm121.8900.6070.34714.620.5

mar 18 10:23pm

mar 19 2:30am

mar 19 6:29am

mar 19 12:28pm

mar 19 5:23pm

AF2

AF2

AF2

AF2

AF2

FrameDiff-Mini: Halved-Parameter SE(3) Diffusion

11.1M params, 6 layers, d_node=192, d_edge=96, d_ffn=768. Same CATH 4.3 data and losses as v3. 4× GPU. TRAINING

Training Progress

EpochL_xL_rRot ErrRMSDVal TotalNotes
02.0630.91978.2°12.85 A3.290
101.3150.68668.2°9.78 A1.884
201.285342866.9°9.67 A1.981L_r spike
301.2547972065.8°9.52 A1.889L_r spike
401.2430.64466.0°9.47 A1.812
501.2620.64566.1°9.51 A1.873
601.2230.63665.2°9.36 A1.844
701.2340.63665.6°9.39 A1.947
761.2200.63365.3°9.31 A1.678BEST
801.2320.63365.1°9.37 A1.893
901.2170.63165.2°9.30 A1.920
1001.2120.62664.7°9.27 A1.856
1101.2160.62865.0°9.27 A1.754
1201.1950.62664.9°9.21 A1.811
1301.2110.62865.2°9.26 A1.969
1401.1980.62464.8°9.19 A1.819
1501.1970.62264.7°9.17 A1.794
1591.1690.61964.3°9.07 A1.918
1601.1740.61563.8°9.03 A1.956
1611.1820.62264.7°9.13 A1.880Latest

Last updated: Mar 20, 2026 at 08:36 AM PST. Epoch 161. Training ongoing (~17 min/epoch).

Training Curves

Configuration

Parameters
11.1M (6 layers, d_node=192, d_edge=96, d_skip=48)
IPA
8 heads, 8 query points, 12 value points (same as v3)
Transformer
4 heads × 2 layers, 768 FFN (v3: 1024)
Training
Same as v3: Adam 1e-4, cosine decay, EMA 0.999, stop_grad, self-cond 50%
Hardware
4× GPU (savio_lowprio), ~17 min/epoch

Generated Structures & Designability (epoch 21 checkpoint)

100-residue backbones, 500 steps, ζ=0.1. ProteinMPNN (8 seqs, temp=0.1) → AlphaFold2 (model_1_ptm, 3 recycles). Top row: generated backbone. Bottom row: AF2 prediction of best MPNN sequence.

SampleBest MPNNpLDDTpTMPAERMSD (Å)Best Sequence
01.5720.8570.5938.4821.12ALGLLLLLLGLLALLLLLLLLPLLEGAELLEAAGLRLLAALLLLHRLLALLELLLELLLLALLLLGGLLLLLALAAALGELGLLLLLLPLLALLLLLLLG
11.0350.4630.27018.7215.24AEALAAGAGGGLALGAALELLAAAGALGGGGGAAAAAGAGGGGGGGAAAGGGLAGGGGGGGAGGSLGAGTGGLGGGAAGAAGGGGGLGGLLGGAAAGAAG

Backbone 0

Backbone 1

AF2 best 0

AF2 best 1

Hourly Tracking (auto-generated)

17 samples over epochs ?. Showing 5 of 17 runs. One random sample per run, ProteinMPNN → AF2 designability.

16.4Å
Med RMSD
13.7Å
Best RMSD
0.392
Med pLDDT
0.527
Best pLDDT
0.200
Med pTM
0.314
Best pTM
TimeEpochMPNNpLDDTpTMPAERMSD
mar 18 10:23pm?0.7570.3840.19921.419.2
mar 18 11:23pm?0.8330.4040.24118.716.5
mar 19 1:33am?0.8010.3750.20020.216.2
mar 19 3:29am?0.8390.3990.16720.215.8
mar 19 5:31am?0.7980.4530.27821.117.3

mar 18 10:23pm

mar 18 11:23pm

mar 19 1:33am

mar 19 3:29am

mar 19 5:31am

AF2

AF2

AF2

AF2

AF2

FrameDiff-Tiny: Minimal SE(3) Diffusion

3.7M params, 4 layers, d_node=128, d_edge=64, d_ffn=512. Reduced noise: σmax=1.0, βmax=10.0. 1× GPU. TRAINING

Training Progress

EpochL_xL_rRot ErrRMSDVal TotalNotes
01.4990.84250.6°10.63 A2.720
100.5560.45736.5°6.29 A1.001
200.5100.42034.1°5.96 A0.923
300.4830.41033.7°5.83 A0.868
400.4500.39732.9°5.64 A0.885
500.4350.39232.7°5.53 A0.843
600.4320.38832.5°5.48 A0.829
700.4170.38332.2°5.37 A0.883
800.4190.38232.3°5.36 A0.843
900.4000.37531.7°5.22 A0.875
1000.4030.37531.9°5.26 A0.856
1100.3990.37131.7°5.21 A0.871
1200.3920.36931.6°5.16 A0.814
1300.3760.36431.2°5.04 A0.832
1400.3800.36431.3°5.08 A0.824
1500.3900.36731.6°5.13 A0.832
1600.3660.36031.0°4.97 A0.859
1630.3750.36131.1°5.01 A0.689BEST
1650.3830.36531.4°5.08 A0.781
1660.3730.36131.1°5.02 A0.812
1670.3780.36031.1°5.03 A0.710Latest

Last updated: Mar 20, 2026 at 08:36 AM PST. Epoch 167. Training ongoing.

Training Curves

Configuration

Parameters
3.7M (4 layers, d_node=128, d_edge=64, d_skip=32)
Transformer
4 heads × 2 layers, 512 FFN
Noise (reduced)
σmax=1.0 (vs 1.5), βmax=10.0 (vs 20.0) — easier denoising, potentially less diverse samples
Hardware
1× GPU (savio_lowprio)

Generated Structures & Designability (best checkpoint)

100-residue backbones, 500 steps, ζ=0.1. ProteinMPNN (8 seqs, temp=0.1) → AlphaFold2 (model_1_ptm, 3 recycles). Top row: generated backbone. Bottom row: AF2 prediction of best MPNN sequence.

SampleBest MPNNpLDDTpTMPAERMSD (Å)Notes
00.1910.3440.15621.3822.19>95% Glycine — failed
10.2980.3690.14921.9520.48>95% Glycine — failed
20.2170.3850.16622.2520.56>95% Glycine — failed
30.1840.3590.15421.3320.91>95% Glycine — failed
40.2430.3610.15121.3919.80>95% Glycine — failed

All 5 samples produce >95% glycine sequences. MPNN cannot find valid sequences for these backbones — the geometry is not protein-like. AF2 predicts random coil (pLDDT ~0.35, pTM ~0.15).

Backbone 0

Backbone 1

Backbone 2

Backbone 3

Backbone 4

AF2 best 0

AF2 best 1

AF2 best 2

AF2 best 3

AF2 best 4

Hourly Tracking (auto-generated)

17 runs, all produced >95% glycine sequences (MPNN < 0.5). Backbones not designable.

Training Progress

v13b COMPLETE
CA-only · ~35M params
Best structural: 2.324 at E60
60/60 epochs · final epoch best!
v14 run2 EARLY STOP
Full backbone · ~38M params
Best structural: 2.425 at E15
Stopped at E30 · pat 15/15
v14 run3 STOPPED
From E15 · BB_LR 2x + L2 reg
Best structural: 2.335 at E24
Stopped E37 · pat 13 (preemption)
v15 EARLY STOP
Best pLDDT: 2.925 (E23)
Stopped E38 · pat 15/15
Frozen trunk · 156K trainable params
v13b Architecture → v14 Architecture → v15 Architecture →

v13b — CA-only (from scratch, CLIP_PS=5.0, dist_head in optimizer)

EpochVal TotalStructuralFAPEFrame Rot Dist MSEBondAux DistChiralityAngle RgStatus
92.6082.5991.6750.8410.3570.0080.3020.3750.1370.038NEW BEST
132.5882.5791.6490.8290.3700.0070.3010.3710.1380.037NEW BEST
172.5512.5421.6410.8260.3520.0060.3060.3560.1290.036NEW BEST
242.5072.4981.6280.8140.3350.0050.3020.3560.1170.035NEW BEST
282.4872.4751.6040.8100.3390.0060.4230.3450.1130.035NEW BEST
312.4712.4611.5970.8010.3370.0050.3450.3590.1160.035NEW BEST
402.4652.4561.5970.7970.3350.0060.3090.3520.1110.034NEW BEST
422.3842.3751.5530.7590.3250.0050.3090.3390.1040.034NEW BEST
542.3802.3711.5640.7730.3220.0050.2880.3390.1060.034NEW BEST
602.3332.3241.5240.7470.3230.0050.2840.3350.1030.034NEW BEST · FINAL
Last 5 epochs
562.4052.3711.5520.7780.3420.0050.2860.3380.1060.034pat 2/15
572.3882.3711.5630.7670.3280.0050.2860.3370.1050.034pat 3/15
582.3962.3711.5570.7750.3330.0050.2850.3370.1050.034pat 4/15
592.3892.3711.5620.7760.3250.0050.2850.3370.1050.034pat 5/15
602.3332.3241.5240.7470.3230.0050.2840.3350.1030.034NEW BEST · FINAL

v14 run2 — Full backbone, from scratch (early stopped)

EpochVal TotalCA StructuralFAPEFrame Rot Dist MSEBB FAPEBB BondOmegaStatus
54.6712.8401.7560.9330.4321.7470.0070.036NEW BEST
84.3112.6231.6450.8430.4101.6350.0050.022NEW BEST
104.2572.5871.6310.8460.3941.6210.0040.020NEW BEST
154.0172.4251.5580.7850.3501.5490.0040.017NEW BEST
Final epochs
264.1322.5231.5790.8170.4151.5690.0040.015pat 11/15
274.1082.5091.5710.8080.4171.5600.0030.014pat 12/15
284.1602.5351.5980.8240.4071.5870.0030.013pat 13/15
294.2192.5781.6140.8370.4281.6030.0040.013pat 14/15
304.2412.5941.6210.8360.4381.6100.0030.013EARLY STOP

v14 run3 — Restarted from E15, BB_LR 2x + L2 offset reg (w=0.1)

EpochVal TotalCA StructuralFAPEFrame Rot Dist MSEBB FAPEBB BondOmegaStatus
243.8652.3351.5000.7430.3501.4920.0030.013NEW BEST
Recent epochs
304.0432.4661.5480.7870.4091.5390.0030.013pat 6/15
313.9982.4251.5440.7840.3751.5350.0030.013pat 7/15
323.9832.4261.5300.7840.3921.5210.0030.013pat 8/15
333.9552.4021.5250.7680.3901.5160.0030.013pat 9/15
344.0012.4291.5440.7960.3781.5350.0030.013pat 10/15
353.9902.5141.5660.8200.3901.5570.0030.013pat 11/15
364.0222.4691.5570.7930.4011.5480.0030.013pat 12/15
374.0452.5021.5520.7960.4141.5430.0030.013pat 13/15

v15 — Full backbone + pLDDT + pAE confidence heads

Frozen trunk (v14 run3 E24 weights). Only pLDDT + pAE heads trainable (156K params). LR 1e-4, w_plddt=0.2, w_pae=0.2. Structural quality preserved from v14 run3.

Epoch Val Total CA Structural FAPE Frame Rot Dist MSE BB FAPE pLDDT pAE DDIM TM Status
Frozen trunk — only pLDDT + pAE heads trainable (156K params). LR 1e-4. Structural losses logged but not optimized.
14.3222.3351.6230.8390.4311.6133.1303.8180.100NEW BEST
24.2052.3351.6000.8270.4211.5903.0233.781NEW BEST
34.1432.3351.5970.8210.4151.5872.9833.757NEW BEST
44.0952.3351.6010.8260.4101.5912.9933.744NEW BEST
54.0752.3351.5980.8200.4121.5882.9893.7430.102NEW BEST
64.0362.3351.5950.8180.4081.5862.9543.756NEW BEST
74.0152.3351.5970.8220.4051.5872.9383.757pat 1/15
83.9942.3351.5990.8210.4031.5892.9633.736pat 2/15
93.9862.3351.6010.8230.4011.5912.9533.754pat 3/15
103.9722.3351.5980.8190.4031.5882.9443.7520.101NEW BEST
113.9582.3351.6000.8200.4001.5902.9333.759NEW BEST
123.9452.3351.5980.8180.4021.5882.9293.750NEW BEST
133.9402.3351.5990.8190.4011.5892.9453.747pat 1/15
143.9362.3351.6000.8200.4001.5902.9403.749pat 2/15
153.9342.3351.6010.8210.3991.5912.9623.7370.102pat 3/15
163.9302.3351.6000.8190.4001.5902.9563.736pat 4/15

Loss Curves — All Models

All models loss comparison

• v13b (blue)   • v14 run2 (gray, stopped)   • v14 run3 (green)   • v15 (purple, when available)   Dotted line = theoretical floor.


Ramachandran Plots (latest per model)

v13b E40 pseudo-Ramachandran (structural 2.456)

v13b pseudo-Ramachandran E40

v14 run3 E24 true Ramachandran (structural 2.335)

v14 run3 true Ramachandran E24

v13 (original) — superseded by v13b

Early stopped at E19 (pat 15/15, best E4 structural 2.373). Two bugs: CLIP_PAIR_STACK=0.5 strangled pair stack learning, dist_head frozen at random init. Full table in Archive tab.

Side Chain Prediction (Future)

Side Chain Reconstruction

Planned
Goal: Extend v14’s full backbone prediction to include Cβ atoms and eventually full side chain reconstruction, enabling the model to produce all-atom protein structures directly from sequence.

Planned Approach

  • Chi angle prediction: Predict side chain dihedral angles (χ1–χ4) conditioned on backbone frames, residue type, and pair representation.
  • Hierarchical placement: Backbone atoms are placed first (from v14 frames), then side chain atoms are built outward using predicted chi angles and ideal bond geometry.
  • Rotamer-aware loss: Side chain loss will account for rotamer distributions — penalizing chi angles that fall outside known rotamer basins for each residue type.
  • Clash penalty: Steric clash loss between predicted side chain atoms to enforce physically valid packing.

Prerequisites

This module depends on v14’s full backbone prediction being stable and well-converged first. Accurate backbone frames are essential — side chain atoms are placed relative to the backbone frame, so errors in backbone geometry propagate directly to side chain positions. Work will begin once v14 demonstrates consistent bond/angle geometry and competitive FAPE scores.

FrameDiff v2 Archive Archived — superseded by v3

FrameDiff v2: IGSO3 Backbone Diffusion (14.4M params, 111 epochs) — click to expand

FrameDiff v2: IGSO3 Backbone Diffusion

Unconditional protein backbone generation via SE(3) diffusion with proper IGSO3 rotation noise and score matching loss. 14.4M params, 8 layers, 2×A40. Cosine LR schedule. Training on CATH 4.3 (10,600 chains). [Full Report PDF — 69 pages with source code]

Current Model

  • 14.4M params · 8× IPA (12 heads) + Transformer + 2-layer MLP backbone updates
  • IGSO3 rotation noise (σmin=0.1, σmax=1.5) + IGSO3 score matching loss
  • VP-SDE translation diffusion + geodesic reverse SDE for sampling
  • Cosine LR decay (1e-4 → 1e-6) — added after E81 explosion with constant LR
  • Self-conditioning (50%) + auxiliary losses (L_bb, L_2d at t<0.25)
  • Stop-gradient on intermediate frames (numerical stability for 8 layers)

Structural Metrics

v2 metrics

Ca RMSD: 8.45 → 4.50 A (v1 best: 4.48 A, dashed). Rotation error: 66.5 → 46.0 deg (v1 models: ~166 deg, mirror-flipped). Orange dotted line marks cosine LR restart at E79.

All Loss Components

v2 all losses

Top-left: Translation MSE (L_x). Top-right: IGSO3 rotation score matching (L_r). Bottom-left: Backbone atom loss (L_bb, t<0.25). Bottom-right: Pairwise Ca distance (L_2d).

Ca RMSD

v2 RMSD

Rotation Angle Error

v2 rotation error

v2 rotation error monotonically decreases (66 → 46 deg) without the π-flip seen in v1 models. The IGSO3 score provides correct directional supervision. For reference: random rotation ~126 deg, v1 Frobenius ~166 deg (mirror-flipped).

Translation Loss (L_x)

v2 L_x

Rotation Loss (L_r, IGSO3 Score Matching)

v2 L_r

Learning Rate Schedule

v2 LR schedule

Constant LR caused catastrophic divergence at E81 (L_r jumped 6.4 → 612). Resumed from best checkpoint (E78) with cosine decay. E79+ trains stably past the explosion zone.

Total Loss

v2 total loss

Training Progress

EpochL_xL_r (IGSO3)Rot Err (deg)RMSD (A)Val TotalNotes
00.8797.6066.58.457.69
90.4676.6852.35.986.90RMSD < 6 A
180.3216.5650.05.136.452×A40 starts
280.3306.5748.95.006.19RMSD < 5 A
400.3096.3847.64.856.94
510.2806.4147.34.736.18
630.2716.3546.64.616.09Best val (pre-explosion)
710.2726.2446.04.556.84RMSD 4.55
81Explosion (L_r=612) — constant LR too aggressive. Resumed from E78 with cosine decay.
79*0.2606.3246.04.506.81Cosine LR, resumed from best.pt
80*0.2766.3946.64.606.09val=6.09, stable
83*0.2536.3946.64.546.41Stable through E81+ zone
88*0.2676.3446.64.566.30
91*0.2606.3346.34.556.64
94*0.2776.2845.84.546.36
97*0.2756.2745.94.526.57
105*0.2426.2745.74.466.21RMSD 4.46, val=6.21 — BEAT v1 8L!
108*0.2656.4246.24.496.61Early stop triggered (disabled)
111*0.2446.2245.54.436.45ALL metrics new lows!

Last updated: Mar 17, 2026. Training ongoing with cosine LR. * = post-restart epoch numbering.

Archived Models

Earlier model variants using Frobenius geodesic loss (not IGSO3). Both exploded around E78-81 due to constant LR.

4-Layer v1 (6.4M params) — best RMSD 4.26 A, exploded E78

Frobenius geodesic rotation loss. Best val=0.298 (E43). Rotation error ~166 deg (pi-flipped). Diverged catastrophically at E78 (L_x=34).

8-Layer v1 (12.8M params) — best RMSD 4.48 A, stopped E36

Frobenius geodesic rotation loss with stop-gradient. Best val=0.327 (E33). Rotation error ~166 deg (pi-flipped). Hit 6h wall clock at E36.


v13 Training Archive Archived — superseded by v13b

v13 early stopped at E19 (patience 15/15, best E4). Two critical bugs prevented further improvement: CLIP_PAIR_STACK=0.5 strangled pair stack learning, and dist_head was missing from the optimizer. Superseded by v13b which fixes both bugs and trains from scratch.

Epoch Val Total FAPE Frame Rot Dist MSE Bond Aux Dist Chirality Angle Rg DDIM TM Status
12.5711.6570.8460.3530.0090.3230.3470.1300.0360.114NEW BEST
22.4941.6100.7920.3590.0070.3240.3360.1150.036NEW BEST
32.5331.6110.8290.3740.0080.3240.3360.1200.036pat 1
42.3731.5460.7710.3120.0080.3230.3190.1130.036NEW BEST
5–18No improvement (patience 1–14). Pair stack strangled by CLIP_PS=0.5, dist_head frozen.pat 1–14
192.4251.5530.7910.3460.0050.3990.3430.1020.035EARLY STOP

v13 Loss Curves (with v12/v12b overlay)

v12/v12b/v13 loss curves

v12b Training Archive Archived — superseded by v13

Protein Backbone Diffusion Model v12b

Complete
Goal: Generate realistic protein backbone structures (Cα coordinates) conditioned on sequence, using contact-aware embeddings from Stage 1. v12 scales IPA capacity 4.1x over v11b (31.1M params) to resolve the gradient competition collapse that limited v11b — same validated loss design, wider single representation, deeper frame update MLP.
Architecture IPA denoiser (8 layers, 8 heads, 8 query points) + independent aux pair stack (64-dim) + frozen ContactClassifier encoder — 31.1M params (29.9M trainable, 1.2M frozen); d_ipa_hidden=512, d_ipa_ffn=1024, 2-layer FrameUpdate MLP
Training Single RTX 2080 Ti (11.5 GB), batch_size=4, grad_accum=4 (eff=16), LR=2e-5, gradient checkpointing, T=1000, DDIM-50 eval, SC=0.25, w_frame_rot=0.5
Dataset CATH 4.2: 18,024 train / 608 val proteins, max 125 residues

Why IPA? The v10 Ceiling

v10 used an 8-layer EGNN denoiser that learned pairwise distance statistics well (dist_mse 60% below random) but could not learn protein topology. FAPE stayed at its random baseline (~1.31) across 21 epochs and TM-score peaked at 0.131 (random ~0.10). EGNN has no concept of local reference frames — it reasons about distances, not backbone geometry. IPA solves this by maintaining and refining per-residue rigid-body frames (rotation + translation) through 3D point attention in local coordinate systems.

v11b Key Changes (from v11)

  • Frame rotation loss (w=0.5) — direct angular distance: 1 - cos(angle) between learned R and Gram-Schmidt ground truth. Gives rotation quaternions direct gradient for the first time.
  • FAPE with learned framesfape_loss_with_frames uses R_pred from IPA layers instead of rebuilding from coordinates. The R → FAPE gradient path is now intact.
  • LR halved — 2e-5 (from 5e-5) to prevent overshooting now that rotations receive gradient
  • Self-conditioning reduced — SC probability 0.25 (from 0.5) to let the model learn from scratch more
  • Initialized from v11 E10 best — EMA weights only, fresh optimizer + scheduler

v12b Training Fixes (over v12)

  • Per-module gradient clipping: denoiser max_norm=1.0, pair_stack=0.5, aux_pair=0.5
  • Per-module LR groups: pair_stack gets 3x base LR (6e-5) to compensate gradient attenuation
  • Pair stack tripwire: if grad norm < 0.01 for 50 consecutive steps, inject gradient noise (scale=0.01)
  • Hard halt: if grad norm < 0.001 for 500 total steps, stop training
  • Atomic checkpoint saves: write to tmp file then rename to survive SLURM preemption
  • NaN loss batches: skipped entirely instead of poisoning gradient accumulator

v12b Loss Weights

Loss v11 v11b/v12b Rationale
FAPE (w/ learned R)1.01.0v11b uses learned R_pred instead of Gram-Schmidt rebuilt frames
Frame Rot0.5NEW: direct angular loss on learned R vs ground truth. The v11b fix.
Bond3.03.0
Clash0.10.1
Aux dist0.030.03
Dist MSE1.01.0
Chirality0.10.1
Angle0.50.5
Rg0.50.5

v12/v12b Loss Curves

V12 diffusion training loss curves

Pseudo-Ramachandran Analysis

Pseudo-Ramachandran plots

Pseudo-dihedrals computed from consecutive Cα positions. Ground truth (left) shows clear α-helix (~50°,50°) and β-sheet (~−120°,120°) clusters.

v12/v12b Full Training Table

Epoch Val Total Val FAPE Val Frame Rot Val Dist Val Bond DDIM TM DDIM RMSD Status
14.9282.0741.1990.8570.1350.10115.28Å NEW BEST
23.5382.0021.1380.5810.0280.13213.73Å NEW BEST
33.0411.8591.0220.4490.0160.11815.33Å NEW BEST
42.8561.7900.9420.3990.0130.10316.56Å NEW BEST
52.8521.7630.9290.4280.0130.10116.88Å NEW BEST
62.7711.7320.9100.4030.0100.10216.99Å NEW BEST
72.7071.7490.8810.3430.0100.09917.26Å NEW BEST
82.7191.7080.8840.3990.0090.09917.23Å pat 1
92.7201.7280.8760.3860.0080.10117.04Å pat 2
102.5811.6510.8330.3520.0080.10217.09Å NEW BEST
112.6851.6840.8640.3910.0090.09817.41Å pat 1
122.8311.8040.9050.3910.0090.08718.46Å pat 2
132.8221.8140.9130.3700.0090.07819.54Å pat 3
— v12b rollback to E10 EMA — per-module grad clipping, pair_stack 3x LR, tripwire
b12.5411.6270.8180.3660.0080.10216.98Å NEW BEST
b22.5771.6620.8310.3580.0080.10516.89Å pat 1
b32.5551.6520.8370.3390.0090.10317.09Å pat 2
b42.6481.6800.8700.3860.0090.10716.88Å pat 3
b52.5651.6440.8350.3560.0080.10417.02Å pat 4
b62.5401.6380.8270.3430.0070.10816.80Å pat 5
b72.6411.6680.8520.3960.0080.11316.47Å pat 6
b82.6071.6510.8510.3790.0080.11316.39Å pat 7
b92.5641.6320.8290.3740.0070.11216.40Å pat 8
b102.5211.6300.8130.3410.0060.11316.42Å NEW BEST
b112.5531.6330.8290.3620.0060.11016.52Å pat 1
b122.5581.6280.8350.3700.0070.11116.56Å pat 2
b132.5341.6350.8260.3430.0070.11016.64Å pat 3
b142.5001.6060.8140.3490.0060.10717.13Å NEW BEST
b152.4801.5860.8070.3910.0060.11216.62Å NEW BEST
b162.4691.5640.7800.3470.0060.11216.62Å NEW BEST
b172.5061.6080.8150.3550.0060.11016.88Å pat 1
b182.4971.6260.8090.3350.0050.10816.83Å pat 2
b192.5171.6070.8200.3650.0060.11016.92Å pat 3
b202.4461.5780.8020.3340.0050.10317.53Å NEW BEST
b212.4821.5890.7970.3620.0060.10417.54Å pat 1
b222.4881.6070.8010.3470.0050.10717.42Å pat 2
b232.6091.6410.8420.4080.0060.10617.61Å pat 3
b242.4501.5890.7990.3290.0060.10217.80Å pat 4
b252.4561.6040.8070.3200.0050.10118.02Å pat 5
b262.4601.5970.7960.3350.0050.10118.02Å pat 6
b272.5481.6350.8390.3640.0050.10517.69Å pat 7
b282.5011.6260.8130.3420.0050.10717.53Å pat 8
b292.4431.5780.7910.3400.0050.11017.02Å NEW BEST
b302.4131.5840.7900.3120.0050.10717.38Å NEW BEST
b312.4621.5870.8080.3420.0050.10817.43Å pat 1
Gradient cosine similarity diagnostic (E10 vs E12 vs E13)

Per-loss gradient directions on shared parameters (denoiser + pair_stack). Negative cosine = direct competition. Positive = aligned.

Loss Pair E10 (best) E12 E13 Verdict
FAPE vs frame_rot +0.59 +0.57 +0.50 Aligned
dist_mse vs FAPE +0.16 +0.14 +0.41 Near-orthogonal
dist_mse vs frame_rot +0.24 +0.00 -0.02 Near-orthogonal
FAPE vs bond_geom -0.04 -0.23 +0.02 Near-orthogonal

Key finding: No gradient competition between any loss pair. FAPE and frame_rot are strongly aligned (+0.5 to +0.6). The E11-E13 regression was caused entirely by gradient starvation of the pair_stack module — not conflicting loss objectives.

v11 post-mortem (different failure mode)
Hypothesis Result
Frame confidence starving updates NO — 57.7% conf > 0.5
Gradients dead at frame_update NO — highest grad norm (4.017)
FAPE gradient reaches frame_update YES — 13.03 (largest)
Learned R used in output NO — R discarded

Root cause (v11): x0_pred = t_vec discarded learned R. v11b fix: frame_rotation_loss + fape_loss_with_frames using learned R.

Loss Reference & Targets

Metric w Type v11b E1 v11b E5 v12 E1 v12 E10 Target / Interpretation
FAPE1.0L11.9341.7562.0741.651 Primary metric. Uses learned R_pred frames. v10 ceiling=1.31, untrained >2.0. <1.0 = correct folds.
Frame Rot0.51−cosθ1.0670.8881.1990.833 Angular error of learned R vs Gram-Schmidt truth. 1.0 = ~90° (random), 0 = perfect. Target: <0.5 by E15.
Dist MSE1.0MSE0.4090.3840.8570.352 Pairwise Cα distance error. <0.1 = sub-Å accuracy. Plateauing ~0.35–0.40.
Bond3.0*MSE0.0150.0100.1350.008 Cα–Cα bond error. *annealed 1→3. Solved.
Rg0.5MSE0.0380.0381.8050.038 Radius of gyration error. Converged by E3.
TM-scoreDDIM0.0940.0910.1010.102 50-step DDIM. Target: >0.15 by E10, >0.30 by E20. >0.17 = recognizable folds.
RMSDDDIM17.22Å17.62Å15.28Å17.09Å <10Å = partial fold. <5Å = high quality.
Last updated: 2026-03-10
v11b Historical Results (14 epochs, gradient competition collapse at E8 — superseded by v12)

v11b validated the IPA + frame rotation loss design. Best epoch E8: FAPE 1.655, frame_rot 0.830, TM 0.093. After E8, gradient competition between dist_mse and frame_rot through shared 128-dim single representation caused collapse — TM dropped to 0.061, frame_rot reverted to 0.926. Stopped at E14.

EpochVal TotalVal FAPEFrame RotVal DistVal BondDDIM TMStatus
13.0571.9341.0670.4090.0150.094BEST
52.7361.7560.8880.3840.0100.091BEST
82.5801.6550.8300.3510.0090.093BEST (peak)
112.7971.8130.8800.3740.0080.079pat 3
142.9731.8980.9260.4230.0110.061pat 6 (stopped)
v11 Historical Results (15 epochs, no frame rotation supervision — superseded)

v11 used the same IPA architecture but had a critical bug: x0_pred = t_vec discarded learned rotation matrices R. FAPE peaked at 1.818 (E10) then degraded to 2.052 (E15). DDIM TM-score collapsed from 0.095 to 0.046.

V11 diffusion training loss curves (historical)
EpochVal TotalVal FAPEVal DistVal BondDDIM TMStatus
14.3381.9501.1630.1710.099BEST
32.7411.9880.4790.0220.128BEST
62.5581.8650.4480.0190.095BEST
102.4311.8180.3820.0150.087BEST (peak)
132.5391.9190.3800.0160.053pat 3
152.7262.0520.4110.0190.046pat 5 (stopped)
v10 Historical Results (21 epochs, EGNN — superseded)

v10 used an 8-layer EGNN denoiser (14.6M params). After 21 epochs: dist_mse 60% below random (0.218), bond essentially solved (0.006), but FAPE stuck at random (~1.31) and TM-score peaked at 0.131. Best val structural = 0.805 (E16). DDIM best: TM=0.131, RMSD=14.53Å.

V10 diffusion training loss curves (historical)

Encoder Variants — Full Comparison

All Encoder Models

Model Params SS Acc Contact F1 Contact P Contact R Dist Acc MLM Acc Class Acc Val Loss
v1 original 1.2M 46.5% (probe) 0.498 0.367 0.773 66.3% 3.988
v2 original 13.3M 78.7% 0.508 0.380 0.770 62.5% 19.2% 69.2% 6.055
v2 small E15 ~5M 78.7% 0.5210.3980.753 62.5% 19.6% 68.1% 6.048
v2 reg E8 ~13.3M 78.6% 0.4910.3590.777 62.2% 19.0% 67.6% 6.053
v2 sm+reg E3 ~5M 78.6% 0.5230.4090.724 61.6% 17.2% 50.8% 6.676

Per-Task Best

Contact F1
0.508
v2 original (E16)
SS Accuracy
78.7%
v2 original — v1 baseline: 46.5%
CATH Class Acc
69.2%
v2 original (E20 equiv)
Distance Acc
62.5%
v2 original (v2 only)
MLM Accuracy
19.2%
v2 original (v2 only)
Best Val Loss
3.988
v1 (3 tasks) — v2: 6.055 (6 tasks)

V2 Architecture Comparison

Encoder v2 variant comparison

V2 variants: architecture sweep (size vs regularization). All hit SS=78.7%, F1~0.52 ceiling.


V3 Sweep — Breaking the Ceiling with Better Training Objectives

Same 5M architecture, 10 variants testing: span masking (mask 5-15 contiguous residues), contrastive learning (InfoNCE on CATH topology), distance regression (smooth L1 on real distances), SS segment masking (mask entire helices/sheets). span_dist broke through the v2 F1 ceiling!

V3 Sweep Results

V3 sweep comparison

Row 1: F1 curves, val loss, precision. Row 2: recall, best F1 bars, latest F1 bars. Row 3: improvement over v2, objective matrix heatmap (sorted by best F1), key findings.

Contrastive Learning = Key Ingredient
All top variants use contrastive loss (InfoNCE on CATH topology). It pulls proteins with the same fold closer in embedding space, giving the pair stack explicit structural signal that binary contacts alone cannot provide. Combined with distance regression, it produces the best F1 (0.567).
Distance Regression > Binned Distance
Predicting exact CA-CA distances (smooth L1) gives richer gradient signal than 32-bin ordinal classification. Every variant that includes dist_reg shows improved precision — the pair head learns to be more selective about which contacts to predict. Distance MAE dropped from 4.5Å to 2.5Å in 8 epochs.
SS Masking = Best Generalizer
Masking entire secondary structure elements (helices/sheets) forces the model to reconstruct spatial patterns from incomplete information. Best val loss (5.94) but F1 oscillates — the augmentation may be too aggressive for 18K proteins. Would likely shine with more data.
Span Masking Helps Loss, Not F1
Masking 5-15 contiguous residues (vs random tokens) produces the best generalizing model by val loss, but doesn’t directly improve contact prediction. The benefit is in richer single representations — useful when these embeddings feed into the diffusion model’s pair stack.

Encoder Analysis — UMAPs, Contact Maps, SS Probes

Encoder analysis: UMAPs, contact maps, SS probes

4 models × 4 panels: UMAP by CATH class, UMAP by SS composition, predicted vs true contact map, SS linear probe accuracy.

Are these encoders good enough for structure prediction?

Yes. All encoders achieve ~78.5–78.8% SS accuracy from a single linear probe — well above the 60% threshold. The SS information is encoded in the single representations. The real differentiator between encoders is contact map quality (F1) and UMAP fold clustering, which determine whether the pair stack can extract long-range structural relationships. The v3_contr_dist encoder (F1=0.567, contrastive-trained) provides the richest representations for v16.


VariantNew ObjectivesE3 F1Best F1E3 Lossvs V2 CeilingStatus
7. contr_distcontrastive + dist regressionE40.5676.764+4.7ppLEADER
6. span_distspan mask + dist regressionE80.560 (E3)6.325+4.0ppsustained
10. fullspan + contr + dist + ss_maskE40.557 (E3)7.175+3.7ppsurging
8. all_threespan + contr + dist regE40.542 (E4)6.857+2.2ppimproving
9. ss_maskSS segment maskingE80.590 (E1)5.938+7.0pp (E1)best loss, F1 oscillates
1. baseline(control)E80.563 (E1)6.277+4.3pp (E1)declining
3. contrastivecontrastive onlyE40.504 (E4)6.696−1.6ppimproving
4. dist_regdistance regression onlyE80.494 (E8)6.396−2.6ppslow climb
2. spanspan masking onlyE80.506 (E1)6.110−1.4ppbest generalizer

Variant Training Progress (last 5 epochs each)

VariantEpochVal LossSS AccContact F1Dist AccMLM AccClass AccStatus
v2 small (5M, dim=192, 4 towers, drop=0.30)
116.23478.6%0.45862.3%18.6%67.4%pat 2
126.19778.6%0.49862.4%19.6%67.1%pat 3
136.18978.7%0.50462.5%19.1%67.3%pat 4
146.22178.7%0.49562.5%19.1%68.1%pat 5
156.38378.7%0.50562.5%19.6%68.1%pat 6
v2 regularized (13M, dim=256, 6 towers, drop=0.45, wd=0.10)
46.33278.6%0.42662.2%19.0%63.2%best
56.28678.3%0.49162.1%18.2%63.8%best
66.08978.7%0.47162.3%19.1%67.3%best
76.09878.7%0.46262.2%19.0%68.1%pat 1
86.05378.6%0.46262.2%18.7%67.6%best
v2 small+reg (5M, dim=192, 4 towers, drop=0.40, wd=0.08) — 1080ti
17.10478.6%0.48861.4%12.7%41.8%best
26.78078.6%0.52361.6%17.2%50.8%best
36.67678.6%0.46762.0%18.0%56.1%best

New Variants

small TRAINING
dim=192, 4 towers, ~5M params
dropout=0.30
regularized TRAINING
dim=256, 6 towers, ~13M params
dropout=0.45, wd=0.10
small+reg TRAINING
dim=192, 4 towers, ~5M params
dropout=0.40, wd=0.08

Encoder V2 Original — 6-Task Multi-Scale Encoder

Encoder V2

Training
Architecture ContactClassifierV2 — 13.3M params, dim=256, 6 transformer towers, 8 heads, d_pair=128, 4 contact prediction blocks with outer product mean
Tasks (6) Contact BCE, CATH Class (4-way), CATH Architecture (35-way), SS Prediction (3-state), Distance Distribution (32 bins), Masked LM (15%)
Max Length 600 residues (was 256 in v1)

Training Progress

Epoch Total Loss Class Acc SS Acc Contact F1 Dist Acc MLM Acc Status
1 7.029 46.5% 78.6% 0.392 61.8% 16.4% NEW BEST
2 6.485 56.4% 78.6% 0.401 62.1% 18.1% NEW BEST
3 6.055 67.4% 78.6% 0.442 62.3% 18.3% NEW BEST
4 6.205 59.7% 78.5% 0.474 62.4% 18.8% NEW BEST

Training Tasks (6)

1. Contact BCE
8Å Cα–Cα binary contact map prediction
2. CATH Class
4-way CE — protein structural class
3. CATH Architecture
35-way CE — fold architecture
4. SS Prediction
3-state H/E/C per-residue secondary structure from phi/psi
5. Distance Distribution
32 bins, 2–20Å — ordinal distance prediction between pairs
6. Masked LM
BERT-style, 15% masking — self-supervised sequence understanding

Encoder V1 — Contact Classifier (Stage 1)

Contact Classifier (Stage 1 — Multi-task Encoder)

Complete
Goal: Train a transformer encoder from scratch on CATH 4.2 dataset (18k proteins) to jointly predict CATH class/architecture labels AND inter-residue contact maps from sequence alone. The learned embeddings encode spatial proximity information needed for Stage 2.
Architecture ContactClassifier — 1.2M params, dim=128, 2 transformer towers, d_pair=64, 2 contact prediction blocks with outer product mean
Training Single GPU (1080 Ti), batch_size=24, lr=2e-4 with warmup cosine schedule, patience=15 early stopping
Resilience Per-epoch checkpoints with auto-resume, CSV loss logging, self-resubmitting watchdog system on SLURM
Training Progress (all 25 epochs — early stopped)
Epoch Val Total Loss Train Class Acc Val Class Acc Train Arch Acc Contact Recall (Val) Contact BCE (Val) LR Status
1 4.846 47.5% 41.8% 16.1% 69.6% 0.759 3.33e-05 NEW BEST
2 4.626 58.2% 54.9% 25.4% 72.3% 0.725 6.67e-05 NEW BEST
3 4.454 65.9% 53.8% 33.8% 73.9% 0.702 1.00e-04 NEW BEST
4 4.356 70.0% 58.4% 39.2% 73.8% 0.693 1.33e-04 NEW BEST
5 4.354 72.9% 60.7% 42.4% 73.7% 0.691 1.67e-04 NEW BEST
6 4.220 74.4% 65.1% 43.7% 75.5% 0.668 2.00e-04 NEW BEST
7 4.321 75.9% 64.8% 44.5% 75.9% 0.686 2.00e-04 pat 1
8 3.994 77.3% 65.5% 46.4% 77.4% 0.660 1.99e-04 NEW BEST
9 3.998 77.2% 66.0% 47.0% 78.0% 0.665 1.98e-04 pat 1
10 3.988 78.1% 66.3% 47.8% 77.3% 0.659 1.97e-04 BEST (final)
11 4.199 78.9% 66.1% 48.5% 78.0% 0.655 1.96e-04 pat 1
12 4.073 79.4% 68.9% 49.1% 78.7% 0.652 1.94e-04 pat 2
13 4.128 79.5% 66.8% 49.9% 77.4% 0.652 1.92e-04 pat 3
14 4.014 80.1% 67.8% 50.1% 77.2% 0.653 1.89e-04 pat 4
15 4.083 80.4% 65.6% 50.7% 78.1% 0.652 1.87e-04 pat 5
16 4.010 81.3% 68.8% 51.5% 77.1% 0.666 1.84e-04 pat 6
17 4.139 80.9% 66.3% 52.1% 77.8% 0.659 1.80e-04 pat 7
18 4.062 81.8% 68.1% 52.8% 77.5% 0.659 1.77e-04 pat 8
19 4.134 81.7% 66.0% 53.4% 77.8% 0.654 1.73e-04 pat 9
20 4.208 82.2% 69.2% 53.8% 77.5% 0.660 1.69e-04 pat 10
21 4.112 82.4% 68.9% 54.5% 77.5% 0.653 1.64e-04 pat 11
22 4.138 83.3% 68.6% 55.1% 78.0% 0.650 1.60e-04 pat 12
23 4.208 82.8% 67.6% 55.2% 77.3% 0.649 1.55e-04 pat 13
24 4.169 83.1% 68.9% 55.6% 77.0% 0.654 1.50e-04 pat 14
25 4.358 83.8% 65.1% 56.4% 78.0% 0.652 1.45e-04 EARLY STOP

Final Results

  • Best val loss: 3.988 at epoch 10 (best weights saved)
  • Val class accuracy: 66.3% (4-way CATH class), Val architecture accuracy: 31.8% (38+ architectures)
  • Contact P=0.367, R=0.773, F1=0.498 at best checkpoint — model successfully learned spatial proximity from sequence
  • Early stopped at epoch 25 (patience 15) — val loss plateaued after epoch 10
  • 1.2M params, trained from scratch on CATH 4.2 (18k proteins)

Training Curves

Contact classifier training curves

Contact Map Predictions (3 test proteins)

Ground truth vs predicted contact maps

Each row shows a held-out test protein from a different CATH structural class. The left column is the ground truth contact map (binary: two Cα atoms < 8Å apart), and the right column is the model’s predicted probability of contact from sequence alone. Metrics (precision P, recall R, and Top-L long-range accuracy) are annotated on each prediction panel.

  • 1bf0.A (L=60, Few Secondary Structure): A small protein with sparse, irregular contacts. The model captures the overall topology despite limited structural regularity.
  • 3ggm.A (L=81, Mainly Beta): Beta-sheet proteins produce characteristic off-diagonal block patterns from strand–strand hydrogen bonding. The model recovers these long-range parallel and anti-parallel strand pairings well.
  • 1f9x.A (L=120, Mainly Alpha): Alpha-helical proteins show strong banded diagonal patterns from helix-internal i→i+4 contacts. The model reproduces both the local helical periodicity and inter-helix contacts at larger separations.

Learned Embedding Space (PCA & UMAP)

PCA and UMAP of learned protein embeddings colored by CATH class and architecture

Attention-pooled protein embeddings (128-dim) from the encoder’s val+test set, projected via PCA and UMAP. The encoder learns to separate CATH classes without explicit contrastive loss — mainly-alpha and mainly-beta proteins form distinct clusters, while alpha-beta proteins span the intermediate region. UMAP reveals finer sub-structure at the architecture level, with several CATH architectures forming tight, well-separated clusters (e.g., 3.40 Rossmann fold, 1.10 orthogonal bundle). This confirms the multi-task training objective (classification + contact prediction) produces structurally meaningful representations suitable for conditioning the downstream diffusion model.

Model Evolution Timeline

v8
Foundation: EGNN Denoiser + Contact Conditioning
First working diffusion model. 8-layer EGNN denoiser with contact-conditioned pair stack and outer product mean. Established the core pipeline: frozen ContactClassifier encoder → pair representation → coordinate denoiser. Learned basic protein compactness (dist_mse well below random) but struggled with topology.
Key learning: Contact conditioning works. Pair stack provides meaningful structural signal.
v9
Loss Function Engineering
Introduced chirality loss (signed volumes), bond angle loss, and steric clash penalty. Discovered clash loss was catastrophically miscalibrated in Rg-normalized coordinates (3.0A threshold mapped to 30A in real space, penalizing everything). Disabled clash, stabilized training.
Key learning: Loss calibration in normalized coordinate space is critical. Removed clash, added bond annealing.
v10
Scaling Up EGNN + Hitting the Ceiling
14.6M params, 8-layer EGNN. Fixed-scale coordinates (divide by 10A instead of per-protein Rg). Self-conditioning. After 21 epochs: dist_mse 60% below random, bonds solved, but FAPE stuck at random baseline (~1.31) and TM-score peaked at 0.131. Diagnosed the root cause: EGNN has no concept of local reference frames, so it cannot optimize frame-aligned metrics.
Key learning: Distance-based denoisers hit a hard ceiling. Local reference frames are required for topology.
v11
IPA Architecture (Bug: Rotations Discarded)
Replaced EGNN with Invariant Point Attention (AlphaFold2-style). 6 IPA layers, 4 heads, 4 query points. FAPE improved to 1.818 (first time below random). But discovered a critical bug: x0_pred = t_vec discarded learned rotation matrices R. FAPE rebuilt frames via Gram-Schmidt, so rotations got no direct gradient. After E10, FAPE degraded to 2.052, TM collapsed to 0.046.
Key learning: Learned rotations must participate in the loss. Gram-Schmidt rebuilding is not enough.
v11b
Frame Rotation Loss (The v11b Fix)
Added frame_rotation_loss: direct angular distance (1 - cos theta) between learned R and Gram-Schmidt ground truth. FAPE now uses learned R_pred instead of rebuilt frames. LR halved to 2e-5. Best at E8: FAPE 1.655, frame_rot 0.830. Then gradient competition between dist_mse and frame_rot through the shared 128-dim single representation caused collapse — TM dropped to 0.061 by E14.
Key learning: Frame rotation loss works, but 128-dim bottleneck causes gradient competition. Need more capacity.
v12
4.1x Capacity Scaling
31.1M params. d_ipa_hidden: 256→512, 8 heads, 8 query points, 2-layer FrameUpdate MLP. Resolved the gradient competition by giving the model enough capacity to represent both distance and frame information without interference. Best E10: val_total 2.581, FAPE 1.651, frame_rot 0.833. Then E11-E13 regressed due to gradient starvation.
Key learning: Capacity scaling works. But global grad clipping lets the denoiser (28.5M params) consume 99%+ of gradient budget, starving the pair stack.
v12b
Per-Module Gradient Clipping + Breakthrough
Rolled back to v12 E10 EMA weights. Applied per-module gradient clipping (denoiser=1.0, pair_stack=0.5), pair_stack 3x LR multiplier, gradient tripwire system, atomic checkpoint saves. 31 epochs trained with 6 NEW BESTs (b14, b15, b16, b20, b29, b30). Confirmed the "consolidate → breakthrough" pattern: long patience plateaus followed by sudden improvements.
Best result: b30 — val_total 2.413, FAPE 1.584, frame_rot 0.790, dist_mse 0.312. All-time record.
v13
Deeper Pair Stack + Triangle Updates for Beta-Sheet Learning
The highest-impact architectural change. v12 learns helices but not beta sheets because its pair stack (4 blocks, max_rel_pos=32) cannot propagate information between residues separated by >32 positions — exactly where sheet contacts live. v13 doubles the pair stack to 8 blocks with Evoformer-style triangle multiplicative updates, replaces linear-clipped RPE with 128 log-scaled bins covering 0-512 residues, and adds classifier-free guidance training (10% conditioning dropout). Initialized from v12b EMA weights.
Status: Training in progress (E12, pat 8/15). E4 best — val_total 2.373, FAPE 1.546, frame_rot 0.771. Surpasses v12b b16 frame_rot (0.780) after only 4 epochs. LR constant at 2e-5; decay starts at E20.

Architecture & Loss Function Details

How the Architecture Evolved

Mini-Fold’s current architecture (~35–38M parameters) is the result of fourteen iterations, each driven by a specific failure mode we needed to understand and fix. What follows is the story of those decisions — not a changelog, but the reasoning that shaped the model.

The Topology Ceiling: Why We Abandoned Equivariant Graph Networks

We started with an 8-layer SE(3)-equivariant graph neural network (EGNN) as the denoiser. The appeal was obvious: EGNN is equivariant by construction, meaning it respects physical symmetries without needing data augmentation. After months of training across several iterations, we hit a hard wall. The model learned that “proteins are compact blobs of the right size” — distance MSE dropped 60% below random baseline, bond lengths converged — but FAPE stayed stuck at ~1.31 (the random baseline) and TM-scores never exceeded 0.131.

The root cause was architectural: EGNN passes messages based on pairwise distances and updates coordinates through distance-weighted vectors. It has no concept of local reference frames. FAPE measures frame-aligned point error — how well the structure looks from each residue’s perspective — and EGNN has no inductive bias to optimize this. The model could get global shape roughly right but could never learn which helix packs against which sheet, which loop connects where.

The IPA Pivot: Learning to Think in Local Frames

The breakthrough came when we replaced EGNN with Invariant Point Attention (IPA), the structure module from AlphaFold2 (Jumper et al., Nature 2021). Each of the 8 IPA layers performs three operations:

  1. Invariant Point Attention — multi-head attention on the single representation, augmented with pair bias and 3D point attention. Each head generates query/key/value points in \(\mathbb{R}^3\) transformed into each residue’s local frame. Attention weights depend on geometric distances between learned points — invariant to global rotation/translation.
  2. Transition MLP — 2-layer feedforward on the single representation.
  3. Frame update — predicts a small quaternion + translation update per residue, composed in the local frame (right-multiplication for SE(3) equivariance). Initialized near-zero so frames are approximately preserved in early training.

This was the single most important architectural decision in Mini-Fold’s development. IPA gave the model the ability to reason about local geometry — the exact thing FAPE demands. Within a few epochs, FAPE began dropping below the random baseline for the first time.

The Hidden Bug That Blocked Topology Learning

But IPA alone was not enough. Our first IPA-based model improved distances and bonds, then collapsed around epoch 14. After extensive debugging, we found a single-line code error: x0_pred = t_vec was silently discarding the learned rotation matrices from IPA. The FAPE loss was being computed using Gram-Schmidt-rebuilt frames from predicted coordinates, so the rotation parameters received zero gradient. The model could learn where atoms should be, but never learned how residues should be oriented.

The fix was twofold: use the IPA-predicted rotations directly in FAPE computation, and add an explicit frame rotation loss (\(\mathcal{L}_{\text{rot}} = 1 - \cos\theta\)) that provides direct angular supervision. This single change — adding rotation loss — was the most important insight in Mini-Fold’s development. It unlocked real topology learning: FAPE dropped to 1.655 and frame_rot to 0.830 within 8 epochs.

Scaling Up: From 8M to 31M Parameters

With the rotation bug fixed, the next bottleneck was capacity. The model showed gradient competition in the shared 128-dimensional single representation — the denoiser, pair stack, and auxiliary heads were all fighting for the same limited bandwidth. We scaled the model roughly 4x (8.4M to 31.1M parameters): d_ipa_hidden from 256 to 512, 8 attention heads, 8 query points, and a 2-layer FrameUpdate MLP. This resolved the gradient competition and set a new record: validation total loss of 2.413, FAPE 1.584, frame_rot 0.790.

Getting there required two training innovations. First, per-module gradient clipping: the denoiser clips at max_norm=1.0, the pair stack at 0.5. Without this, the pair stack’s gradients starved as the larger denoiser dominated updates. Second, a 3x learning rate multiplier for the pair stack, ensuring this smaller module could keep pace with the denoiser.


The Beta-Sheet Problem and the Current Architecture

Why Alpha-Helices Were Easy and Beta-Sheets Were Impossible

After the scaling breakthrough, we had record structural metrics — but visual inspection of generated structures revealed a systematic failure. The model produced beautiful alpha-helices but could not form beta-sheets. The diagnosis pointed to one number: max_rel_pos=32.

Our pair stack encoded relative position as a linear-clipped embedding with 65 bins covering separations from -32 to +32 residues. Alpha-helix hydrogen bonds connect residues i and i+4 — well within this window. But beta-sheet hydrogen bonds typically connect residues 20 to 100+ positions apart in sequence. Any pair separated by more than 32 positions had zero relative position signal. The model was literally blind to the long-range contacts that define sheet topology.

Triangle Multiplicative Updates: Teaching the Model Transitivity

The current architecture (~35–38M params) makes three targeted changes to solve the beta-sheet problem. The proven IPA denoiser, auxiliary pair stack, R\(_g\) predictor, and distance head are all warm-started from the previous best checkpoint. The main pair stack is entirely new and trains from scratch.

The first change doubles the pair stack from 4 to 8 EnhancedPairBlock blocks, each now equipped with Evoformer-style triangle multiplicative updates (outgoing + incoming) before the existing row/column axial attention + FFN. The intuition behind triangle updates is transitivity: “if residue i contacts residue k, and residue j contacts residue k, then i and j are structurally related.” This is exactly how beta-sheet topology works — two strands share contacts through connecting loop residues.

Implementation: each triangle update projects the pair representation to gate/value tensors (tri_mul_dim=64), computes einsum('bikd,bjkd->bijd') (outgoing) or 'bkid,bkjd->bijd' (incoming), then projects back to d_pair=128. The einsum is forced to fp32 to prevent fp16 overflow from the L-dimensional accumulation. Output is clamped to [-1e4, 1e4] and returned as a residual delta to avoid catastrophic cancellation.

Log-Scaled Position Encoding: Seeing the Whole Protein

The second change replaces the linear-clipped relative position encoding (65 bins, max separation 32) with 128 log-spaced bins covering separations from 0 to 512 residues. The encoding is sign-aware (separate embeddings for upstream and downstream). The bin spacing reflects a simple biological insight:

  • Bins 0–8: Linear spacing (1-residue resolution) for helix contacts (i+3, i+4)
  • Bins 8–128: Log-spaced for long-range sheet contacts (i+20 to i+512)

Local structure needs fine-grained resolution; long-range contacts just need to be distinguishable from each other. Additionally, 32-dimensional sinusoidal continuous RPE features (sin/cos encoding projected to d_pair) provide smooth interpolation between discrete bins.

Classifier-Free Guidance: Conditional and Unconditional Generation

The third change adds classifier-free guidance (CFG) training. 10% of training batches (p_uncond=0.1) replace residue tokens with MASK tokens (token ID 1), preserving CLS/EOS/PAD structure so attention masks remain valid. This trains the model for both conditional and unconditional generation. At sampling time, CFG enables guided generation: ε = εuncond + w · (εcond − εuncond).

A subtle bug we caught early: our first implementation used PAD (token 0) as the null token. Since the attention mask is computed as ids.ne(PAD), this produced an all-False mask, creating degenerate pair representations that caused NaN. Using MASK (token 1) instead preserves valid masks.


Key Design Patterns

Several design choices emerged through hard-won lessons across multiple iterations. They now form the foundation that all architecture variants build on.

SNR-Gated Frame Initialization

Per-residue rigid frames are built from noised C\(\alpha\) coordinates via Gram-Schmidt orthogonalization on consecutive backbone triplets. We discovered early on that at high noise (SNR < 1.0, roughly t > 700), the noised coordinates are near-isotropic and Gram-Schmidt is numerically unstable. Without intervention, the first IPA layer receives garbage frames, producing cascading errors through all 8 layers. Our solution smoothly blends toward identity frames as noise increases:

$$\text{conf}(t) = \text{clamp}\!\left(\frac{\text{SNR}(t) - 0.2}{1.0 - 0.2},\, 0,\, 1\right), \qquad R_{\text{init}} = \text{slerp}(I,\, R_{\text{GS}},\, \text{conf})$$

This was a major source of training instability before we identified and fixed it.

Frame-Aware Self-Conditioning

25% of training steps (reduced from 50% in earlier iterations): run a no-grad forward pass to get x\(_{0}^{\text{prev}}\), build clean frames from it (treated as t=0), and use those as the initial frames for the second pass. At high noise where x\(_t\) frames are identity, self-conditioning provides the model’s best guess at clean local geometry — the IPA layers refine good frames instead of building them from scratch.

We reduced SC probability from 50% to 25% after observing that higher rates meant fewer “cold start” training steps. The model needs enough cold-start experience to generalize at inference when no previous prediction exists.

Fixed-Scale Coordinates

All coordinates are divided by a fixed constant (10\(\text{\AA}\)) instead of per-protein R\(_g\). This seems like a minor choice, but it was one of the most impactful changes. R\(_g\) normalization made the noise schedule protein-size-dependent: a protein with R\(_g\)=5\(\text{\AA}\) had coordinate values ~1.0 while R\(_g\)=25\(\text{\AA}\) gave ~0.2–0.5, meaning the same noise level destroyed more signal for larger proteins. This silently capped TM-scores and looked like a “plateau” rather than a systematic bias. All successful protein diffusion models (FrameDiff, RFDiffusion, Genie) use fixed-scale coordinates — once we switched, the plateau disappeared.

Learning Rate Schedule and Numerical Stability

The current learning rate schedule uses three phases, motivated by the observation that breakthroughs happen in narrow LR windows:

  1. Warmup (3 epochs): Linear 0.01x to 1x base LR
  2. Constant (20 epochs): Hold at peak LR=2e-5 (pair_stack at 3x = 6e-5)
  3. Cosine decay (~37 epochs): Slow decay to eta_min=1e-6

During development we also uncovered three classes of fp16 instability (detailed in the NaN Debugging Story under the Structure Folding tab):

  • fp16 epsilon underflow: clamp(min=1e-6) underflows to 0 in fp16 (min positive ~6e-5). Fixed by forcing fp32 in Gram-Schmidt, slerp, IPA point attention, and all loss functions.
  • Triangle einsum overflow: L-dimensional accumulation exceeds fp16 max (65504). Forced fp32 with torch.amp.autocast("cuda", enabled=False).
  • Unsafe torch.cdist backward: Produces NaN when distance is exactly 0. Replaced with manual (diff.pow(2).sum(-1) + 1e-10).sqrt().

Loss Functions

Mini-Fold trains with a composite loss that blends global fold accuracy, local backbone geometry, and auxiliary supervision signals. Each term has a specific role motivated by failures in earlier model versions and by lessons from the protein structure prediction literature. The total loss is:

$$\mathcal{L}_{\text{total}} = \underbrace{w_{\text{fape}} \cdot \mathcal{L}_{\text{fape}} + w_{\text{rot}} \cdot \mathcal{L}_{\text{rot}} + w_{\text{dist}} \cdot \mathcal{L}_{\text{dist}}}_{\text{global structure}} + \underbrace{\beta(e) \cdot \mathcal{L}_{\text{bond}} + w_{\chi} \cdot \mathcal{L}_{\chi} + w_{\theta} \cdot \mathcal{L}_{\theta}}_{\text{local geometry}} + \underbrace{w_{\text{rg}} \cdot \mathcal{L}_{\text{rg}} + w_{\text{aux}} \cdot \mathcal{L}_{\text{aux}}}_{\text{auxiliary}}$$

The grouping reflects a key design principle: global structure losses are hard and slow (they encode topology), while local geometry losses are easy and fast (they converge within a few epochs). Getting this balance wrong was the source of most training failures in v8–v10.

Global Structure Losses

These losses teach the model to produce correct protein folds — the right contacts, the right topology, the right orientations. They are the hardest to optimize and the primary drivers of structural quality.

FAPE — Frame-Aligned Point Error  \(w = 1.0\)

Introduced by Jumper et al. in AlphaFold2 (Nature, 2021), FAPE is the gold standard loss for protein structure models. It measures how well predicted coordinates match ground truth in each residue’s local reference frame:

$$\mathcal{L}_{\text{fape}} = \frac{1}{N_f \cdot L} \sum_{f=1}^{N_f} \sum_{j=1}^{L} \min\!\Big( \| R_f^{\top}(\hat{x}_j - o_f) - R_f^{*\top}(x_j - o_f^*) \|,\; d_{\text{clamp}} \Big)$$

Unlike RMSD, which is a single global average, FAPE evaluates structure from every residue’s perspective. A model can have low RMSD by getting the overall shape roughly right, but FAPE requires getting the local neighborhoods correct — which helix packs against which sheet, which loop connects where.

Random baseline: ~1.31. Drops below 1.0 only when the model learns correct fold topology. This is the loss that distinguishes a blob from a protein.

Frame Rotation Loss  \(w = 0.5\)

Direct supervision on per-residue backbone orientations, measuring the angular distance between predicted and ground-truth rotation matrices: \(\mathcal{L}_{\text{rot}} = 1 - \cos\theta\). This is conceptually similar to the auxiliary heads in AlphaFold2’s structure module, but applied directly to the IPA-predicted frames.

Why this loss exists (the v11 rotation bug): In v11, a code error (x0_pred = t_vec) silently discarded the learned rotations from IPA. FAPE was computed using Gram-Schmidt-rebuilt frames, so rotations received zero gradient. The model could reduce distance-based losses but never learned orientations. Adding frame_rot was the single fix that unlocked topology learning — the most important architectural insight in Mini-Fold’s development.

Random baseline: ~1.0 (frames pointing ~90° off). Target: <0.5. Currently at 0.86–0.88 — the model is learning orientations but still has significant room to improve.

Distance MSE  \(w = 1.0\)

MSE on all pairwise C\(\alpha\) distances. Pairwise distance matrices (also called contact maps when thresholded) were the original representation used in early structure prediction methods (e.g., trRosetta, Yang et al., PNAS 2020). They’re easy to predict but fundamentally limited — many different 3D folds can produce similar distance distributions.

$$\mathcal{L}_{\text{dist}} = \frac{1}{|\mathcal{M}|} \sum_{(i,j) \in \mathcal{M}} \left( \| \hat{x}_i - \hat{x}_j \| - \| x_i - x_j \| \right)^2$$

Random baseline: ~0.54. The easiest structural loss — even the EGNN models (v8–v10) could reduce this 60% below random. Useful as a stable gradient signal early in training before FAPE gradients become informative.

Local Geometry Losses

These losses enforce physical constraints on the backbone chain — correct bond lengths, angles, and handedness. They converge quickly (typically within 2–3 epochs) because they only require learning local patterns, not global topology. The key challenge is preventing them from dominating early training and blocking global structure learning.

Bond Geometry  \(\beta(e) = \min(3.0,\; 1.0 + 2.0 \cdot \min(e/10, 1))\)

MSE between consecutive C\(\alpha\)–C\(\alpha\) distances and the ideal value of 3.8Å (the standard peptide bond length from crystallography). The weight is annealed from 1.0 to 3.0 over 10 epochs.

Why anneal? This was learned the hard way in v9. Starting with high bond weight causes the model to produce “bead-on-a-string” chains — perfectly spaced at 3.8Å but with no tertiary structure. The model satisfies the easy loss and ignores the hard one (FAPE). Annealing lets the model first explore topology, then gradually tightens the physical constraints.

Random baseline: ~0.17. Converged: <0.02 (bonds within 0.1Å of ideal). Currently at 0.009 — essentially solved.

Chirality  \(w = 0.1\)

MSE on signed volumes (scalar triple products) of C\(\alpha\) quartets. Natural proteins are built exclusively from L-amino acids, which produces a consistent backbone handedness. Without explicit chirality supervision, diffusion models can generate mirror-image (D-amino acid) structures that score well on all other metrics — FAPE, RMSD, and distance MSE are all chirality-agnostic.

This is a known issue in the protein generation literature. FrameDiff (Yim et al., ICLR 2024) addresses it through SE(3) equivariance; we use an explicit loss term instead, which is simpler and equally effective for our architecture.

Bond Angle  \(w = 0.5\)

MSE on cosines of C\(\alpha\)–C\(\alpha\)–C\(\alpha\) bond angles. The ideal angle is ~120° (\(\cos\theta \approx -0.5\)), reflecting the planar geometry of the peptide bond. Working in cosine space avoids the discontinuity at 0°/360° that plagues angular losses.

Random baseline: ~0.70. Converged: <0.1. Currently at 0.038 — solved.

Auxiliary Losses

These losses provide indirect training signals that stabilize learning without directly measuring structural quality. They have low weights and primarily serve as regularizers.

Radius of Gyration  \(w = 0.5\)

A separate MLP predicts the protein’s absolute radius of gyration (R\(_g\)) from sequence embeddings, trained with MSE on log-transformed values. At inference, this recovers the correct physical scale when converting from normalized coordinates back to Angstroms.

Converges below 0.05 by epoch 2 and stays solved. The simplest loss — included for practical utility, not structural learning.

Auxiliary Distance Cross-Entropy  \(w = 0.03\)

Ordinal regression on binned pairwise distances (32 bins, 2–40Å) from an independent lightweight pair stack (64-dim, ~500K params). Inspired by the distogram head in AlphaFold2, which predicts distance distributions as a byproduct of the pair representation.

Key design choice: The auxiliary pair stack is completely independent from the main pair stack — no shared parameters, no gradient flow between them. This was a hard-won lesson from v10, where the distance head read from the main pair stack via detach(). As the main stack’s representations evolved during training, the distance head’s features drifted out of distribution, causing aux_dist_ce to explode (3.95 → 38.8 over 3 epochs). A fully independent stack avoids this coupling entirely.


DDIM Evaluation Metrics

Every 5 epochs, we generate structures via 50-step DDIM sampling (Song et al., ICLR 2021) using EMA weights and evaluate against ground truth. These metrics measure generative quality — how good the model’s samples are — as opposed to the training losses which measure denoising accuracy.

Metric What it measures Random Good Reference
TM-score Global fold similarity [0, 1]. Topology-sensitive, length-independent. ~0.10 >0.5 Zhang & Skolnick, Proteins 2004
CA-RMSD Average atomic displacement after optimal superposition. ~15–16Å <5Å Kabsch, Acta Cryst 1976
GDT-TS Fraction of residues within 1/2/4/8Å cutoffs, averaged. ~3–4% >50% Zemla, Nucleic Acids Res 2003

Current status: At E5, v14’s DDIM samples give TM=0.111 (near random). Training losses are improving steadily but generative quality typically lags — expect a nonlinear jump in DDIM metrics around E10–15 once the denoiser learns enough of the score function for iterative refinement to work. This “phase transition” is well-documented in score-based diffusion models (Song et al., ICLR 2021).


Previous Architecture Details

v11/v12 — IPA-Based Frame Denoising (superseded by v13)

Why the pivot from v10: EGNN has no concept of local reference frames — it passes messages based on pairwise distances and updates coordinates through distance-weighted vectors. FAPE measures frame-aligned point error, which EGNN has no inductive bias to optimize. v11 replaced EGNN with IPA (AlphaFold2-style), explicitly maintaining per-residue rigid-body frames (R ∈ SO(3) + t ∈ ℝ³).

v11 → v11b: v11 had a critical bug: x0_pred = t_vec discarded learned R. v11b added frame_rotation_loss and used learned R in FAPE. Best E8: FAPE 1.655, frame_rot 0.830. Collapsed at E14 due to gradient competition in the shared 128-dim single representation.

v12: 4.1x capacity scaling (8.4M → 31.1M params). d_ipa_hidden: 256→512, 8 heads, 8 query points, 2-layer FrameUpdate MLP. Resolved gradient competition. Best E10: val_total 2.581. Then regressed E11–E13 due to gradient starvation of the pair stack.

v12b: Per-module gradient clipping (denoiser=1.0, pair_stack=0.5), pair_stack 3x LR. 31 epochs, 6 NEW BESTs. All-time record b30: val_total 2.413, FAPE 1.584, frame_rot 0.790.

Loss v10 v11 v11b/v12b Rationale
FAPE0.31.01.0IPA can optimize frame consistency
Frame Rot0.5v11b fix: direct angular loss on learned R
Bond5.03.03.0Gentler anneal avoids tug-of-war
Clash0.00.10.1Fixed-scale coords make threshold meaningful
Dist MSE1.01.01.0

Training: LR 5e-5 (v11) → 2e-5 (v11b+), cosine decay, 1000 diffusion timesteps, DDIM-50 eval.

v10 — EGNN Denoiser + Full Loss Function Derivations (superseded)

8-layer SE(3)-equivariant graph neural network (EGNN), 14.6M params. Operated in R\(_g\)-normalized coordinate space. After 21 epochs: dist_mse 60% below random, bonds solved, but FAPE stuck at random baseline (~1.31) and TM-score peaked at 0.131. The model learned “proteins are compact blobs of the right size” but could not learn topology.

Loss Function Derivations

Distance MSE

$$\mathcal{L}_{\text{dist}} = \frac{1}{|\mathcal{M}|} \sum_{(i,j) \in \mathcal{M}} \left( \| \hat{x}_i^{(0)} - \hat{x}_j^{(0)} \| - \| x_i^{(0)} - x_j^{(0)} \| \right)^2$$

Bond Geometry (annealed \(\beta(e) = \min(5.0, 1.0 + 4.0 \cdot \min(e/15, 1))\))

$$\mathcal{L}_{\text{bond}} = \frac{1}{L-1} \sum_{i=1}^{L-1} \left( \| \hat{x}_i^{(0)} - \hat{x}_{i+1}^{(0)} \| - \frac{3.8}{R_g} \right)^2$$

FAPE

$$\mathcal{L}_{\text{fape}} = \frac{1}{N_f \cdot L} \sum_{f=1}^{N_f} \sum_{j=1}^{L} \min\!\Big( \| R_f^{\top}(\hat{x}_j - o_f) - R_f^{*\top}(x_j - o_f^*) \|,\; d_{\text{clamp}} \Big)$$

Chirality (signed volumes)

$$\mathcal{L}_{\chi} = \frac{1}{L-3} \sum_{i=1}^{L-3} \left( \frac{\mathbf{v}_1 \cdot (\mathbf{v}_2 \times \mathbf{v}_3)}{\|\mathbf{v}_1\| \|\mathbf{v}_2\| \|\mathbf{v}_3\|} \bigg|_{\hat{x}} - \frac{\mathbf{v}_1 \cdot (\mathbf{v}_2 \times \mathbf{v}_3)}{\|\mathbf{v}_1\| \|\mathbf{v}_2\| \|\mathbf{v}_3\|} \bigg|_{x} \right)^2$$

Bond Angle

$$\mathcal{L}_{\theta} = \frac{1}{L-2} \sum_{i=1}^{L-2} \left( \cos\hat{\theta}_i - \cos\theta_i \right)^2$$

Radius of Gyration

$$\mathcal{L}_{\text{rg}} = \left( \log \hat{R}_g - \log R_g \right)^2$$

Auxiliary Distance CE (disabled in v10, w=0.03 in v11+)

$$\mathcal{L}_{\text{aux}} = -\frac{1}{|\mathcal{M}'|} \sum_{(i,j) \in \mathcal{M}'} \log p_{ij}\big[\text{bin}(d_{ij})\big]$$

96 bins in v10 (2–40\(\text{\AA}\), \(\Delta\)=0.396\(\text{\AA}\)/bin). Replaced with 32-bin ordinal regression in v11+ after discovering detach-induced feature drift divergence.

Clash Loss (disabled, w=0)

$$\mathcal{L}_{\text{clash}} = \frac{1}{|\mathcal{N}|} \sum_{(i,j) \in \mathcal{N}} \left[ (1 + 2 c_{ij}) \cdot \text{ReLU}(3.0 - d_{ij}) \right]^2$$

Why disabled: 3.0\(\text{\AA}\) threshold in R\(_g\)-normalized coords maps to ~30\(\text{\AA}\) in real space, penalizing nearly all non-bonded pairs. Dominated ~47% of total loss in v9, drowning structural learning signal. Structural losses handle steric quality implicitly.

v10 Training Configuration
OptimizerAdamW (\(\beta_1=0.9, \beta_2=0.999\))
Peak LR\(10^{-4}\) with CosineAnnealingWarmRestarts (\(T_0=15\))
Batch size8 (grad accum=2, effective=16)
Mixed precisionAMP with GradScaler
EMADecay=0.999
Self-conditioning50% probability
HardwareSingle NVIDIA A40 (48 GB)