Sergio E. Mares
Computational Biology Ph.D. Candidate at UC Berkeley

I am a fifth year Computational Biology Ph.D. Candidate at UC Berkeley, advised by Professor Nilah Ioannidis at the Center for Computational Biology and Professor Joseph Costello at the UCSF Neurosurgery Department.


My research focuses on building machine learning models for cancer immunotherapy. I develop protein language models for predicting peptide-MHC class I binding affinity and use structure-conditioned diffusion models to design novel immunogenic peptide libraries. The goal is to expand the space of targetable tumor antigens, particularly for brain tumors where current therapeutic options are limited.


In Summer 2025 I interned at Ultima Genomics, where I integrated a DNA sequence simulation into the production sequencing pipeline (reducing reagent use by 50%) and built a scalable single-cell ATAC-seq processing pipeline handling up to 100M cells end-to-end.

Sergio E. Mares
pMHC-I binding
Sergio E. Mares, Ariel Espinoza, Nilah M. Ioannidis
Machine Learning in Computational Biology (MLCB), 2025
We test whether domain-specific continued pre-training of protein language models is beneficial for pMHC-I binding affinity prediction. Starting from ESM Cambrian (300M parameters), we perform masked-language modeling on HLA-associated peptides and fine-tune for quantitative IC50 binding affinity prediction.
Structure-guided pMHC-I design
Sergio E. Mares, Ariel Espinoza, Nilah M. Ioannidis
ICML Gen AI and Biology Workshop, 2025
We introduce a structure-guided benchmark of pMHC-I peptides designed using diffusion models conditioned on crystal structure interaction distances, spanning twenty high-priority HLA alleles.
Calcium signaling protein structure
Biraj B. Kayastha, A. Kubo, J. Burch-Konda, R. L. Dohmen, J. L. McCoy, R. R. Rogers, Sergio E. Mares, J. Bevere, A. Huckaby, W. Witt, S. Peng, B. Chaudhary, S. Mohanty, M. Barbier, G. Cook, J. Deng, M. Patrauchan
Nature Scientific Reports, 2022
We study the putative Ca²+-binding protein EfhP (PA4107) and CalC as proteins involved in the calcium network, elucidating the mechanisms of bacterial Ca²+ signaling in Pseudomonas aeruginosa.
Baculovirus invadosome dynamics
Domokos I. Lauko, Taro Ohkawa, Sergio E. Mares, Matthew D. Welch
Molecular Biology of the Cell, 2021
We investigate how AcMNPV protein actin rearrangement inducing factor-1 (Arif-1) induces the formation of cortical concentrations of polymerized actin (ventral aggregates) in cultured insect cells.
Pseudomonas aeruginosa
Sergio E. Mares, M. King, A. Kubo, A. Khavov, E. Lutter, N. Youssef, M. Patrauchan
Journal of Microbiology, 2020
We study the conservation of carP sequence and its occurrence in diverse phylogenetic groups, finding that carP and its two paralogues are primarily present in P. aeruginosa and belong to the core genome, demonstrating potential as a biomarker.
Myxococcota swarming
Chelsea L. Murphy, R. Yang, T. Decker, C. Cavalliere, V. Andreev, N. Bircher, J. Cornell, R. Dohmen, C. J. Pratt, A. Grinnell, J. Higgs, C. Jett, E. Gillett, R. Khadka, Sergio E. Mares, C. Meili, J. Liu, H. Mukhtar, Mostafa S. Elshahed, Noha H. Youssef
Environmental Microbiology, 2021
Detailed analysis of 13 distinct pathways crucial to predation and cellular differentiation reveals severely curtailed machineries, proposing that these represent a niche adaptation strategy that evolved circa 500 million years ago.
Blog post image
Teaching a Protein Language Model to Speak "Immune"
February 2026
A walkthrough of our MLCB 2025 paper on continued pre-training of protein language models for pMHC-I binding prediction — why we did it, how it works, and what surprised us.
Blog post image
What If We Could Design Immune Peptides from Scratch — Using Physics Instead of Data?
February 2026
A walkthrough of our ICML 2025 workshop paper on generating pMHC-I libraries with diffusion models — the dataset bias problem, our structure-first approach, and why existing predictors completely failed on our designed peptides.

A collection of informal reviews of papers I find interesting — mostly in the protein structure prediction, protein design, and protein language model space. These are from Sergey Ovchinnikov's lab and related groups. Just my thoughts, nothing too formal.

Protein Diffusion Models as Statistical Potentials
Roney, Ou, Ovchinnikov · bioRxiv 2025
What if we could repurpose protein diffusion models as energy functions? ProteinEBM does exactly that — turning a generative model into a scoring function that can rank structures, predict conformational landscapes, and estimate mutation effects.
Designing Novel Solenoid Proteins with In Silico Evolution
Pretorius, Nikov, Washio, Florent, Taunt, Ovchinnikov, Murray · Communications Chemistry 2025
Solenoid proteins are nature's modular building blocks. This paper uses AlphaFold2 as an oracle inside a genetic algorithm to design entirely new solenoid folds — and 20% of them actually work in the lab.
CIRPIN: Learning Circular Permutation-Invariant Representations to Uncover Putative Protein Homologs
Kolodziej, Abulnaga, Ovchinnikov · bioRxiv 2025
Most structure comparison tools miss proteins that are related by circular permutation. CIRPIN fixes this with a clever graph neural network that doesn't care where the chain starts — uncovering thousands of hidden evolutionary relationships.
Hit or Miss: Understanding Emergence and Absence of Homo-oligomeric Contacts in Protein Language Models
Zhang, Akiyama, Cho, Jajoo, Ovchinnikov · bioRxiv 2025
Protein language models are trained on single chains, yet they somehow learn about protein-protein interfaces. This paper digs into how and why — and finds that bigger models keep getting better at inter-chain contacts even after intra-chain accuracy plateaus.
Assessing the Utility of Coevolution-Based Residue–Residue Contact Predictions in a Sequence- and Structure-Rich Era
Kamisetty, Ovchinnikov, Baker · PNAS 2013
The 2013 paper that helped establish when coevolution-based contact prediction is actually useful. A foundational work that set the stage for everything from direct coupling analysis to AlphaFold.

More papers I find interesting

De Novo Design of Protein Structure and Function with RFdiffusion
Watson, Juergens, Bennett et al. · Nature 2023
The paper that brought diffusion models to protein design in a big way. RFdiffusion generates protein backbones from scratch and can design binders, symmetric assemblies, and enzyme scaffolds — many validated experimentally.
Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model (ESMFold)
Lin, Abanades, Rao, Johnson, Rives et al. · Science 2023
What if you could predict protein structure from a single sequence, no alignment needed? ESMFold does this at AlphaFold-like accuracy with a 15 billion parameter language model, enabling structure prediction for 600+ million metagenomic proteins.
Simulating 500 Million Years of Evolution with a Language Model (ESM3)
Hayes, Rao, Akin et al. · Science 2025
ESM3 is a 98-billion-parameter multimodal model that reasons over protein sequence, structure, and function simultaneously. It designed a novel fluorescent protein with only 58% identity to anything in nature — equivalent to 500 million years of evolution.
Accurate Structure Prediction of Biomolecular Interactions with AlphaFold 3
Abramson, Adler, Dunger et al. · Nature 2024
AlphaFold 3 moves beyond proteins to predict the structures of complexes involving DNA, RNA, small molecules, and ions — with a diffusion-based architecture that substantially outperforms specialized tools for drug-like interactions.
Protein Language Models Learn Evolutionary Statistics of Interacting Sequence Motifs
Zhang, Wayment-Steele, Brixi, Wang, Kern, Ovchinnikov · PNAS 2024
What do protein language models actually learn? This paper shows ESM-2 stores coevolutionary statistics as motifs of pairwise contacts — bridging the gap between classical coevolution and modern deep learning.
Molecular Modeling and Simulation
Molecular Modeling and Simulation: An Interdisciplinary Guide
Tamar Schlick
Finished
Pedro Páramo
Pedro Páramo
Juan Rulfo
Currently Reading
On the Origin of Species
On the Origin of Species
Charles Darwin
Currently Reading
Structural Bioinformatics
Structural Bioinformatics
Philip E. Bourne & Helge Weissig
Currently Reading
Soviet Middlegame Technique
Soviet Middlegame Technique
Peter Romanovsky
Currently Reading
Miles de millones
Miles de millones
Carl Sagan
Currently Reading
Cien años de soledad
Cien años de soledad
Gabriel García Márquez
Currently Reading
♘ Chess
I've been playing chess since I moved to the US. I mainly play rapid and blitz on Lichess. Feel free to challenge me!
💻 Open Source
Building tools at the intersection of ML and biology. Check out my projects on GitHub.

Last updated: March 7, 2026

Contact Classifier (Stage 1 — Multi-task Encoder)

Complete
Goal: Train a transformer encoder from scratch on CATH 4.2 dataset (18k proteins) to jointly predict CATH class/architecture labels AND inter-residue contact maps from sequence alone. The learned embeddings encode spatial proximity information needed for Stage 2.
Architecture ContactClassifier — 1.2M params, dim=128, 2 transformer towers, d_pair=64, 2 contact prediction blocks with outer product mean
Training Single GPU (1080 Ti), batch_size=24, lr=2e-4 with warmup cosine schedule, patience=15 early stopping
Resilience Per-epoch checkpoints with auto-resume, CSV loss logging, self-resubmitting watchdog system on SLURM

Training Progress (all 25 epochs — early stopped)

Epoch Val Total Loss Train Class Acc Val Class Acc Train Arch Acc Contact Recall (Val) Contact BCE (Val) LR Status
1 4.846 47.5% 41.8% 16.1% 69.6% 0.759 3.33e-05 NEW BEST
2 4.626 58.2% 54.9% 25.4% 72.3% 0.725 6.67e-05 NEW BEST
3 4.454 65.9% 53.8% 33.8% 73.9% 0.702 1.00e-04 NEW BEST
4 4.356 70.0% 58.4% 39.2% 73.8% 0.693 1.33e-04 NEW BEST
5 4.354 72.9% 60.7% 42.4% 73.7% 0.691 1.67e-04 NEW BEST
6 4.220 74.4% 65.1% 43.7% 75.5% 0.668 2.00e-04 NEW BEST
7 4.321 75.9% 64.8% 44.5% 75.9% 0.686 2.00e-04 pat 1
8 3.994 77.3% 65.5% 46.4% 77.4% 0.660 1.99e-04 NEW BEST
9 3.998 77.2% 66.0% 47.0% 78.0% 0.665 1.98e-04 pat 1
10 3.988 78.1% 66.3% 47.8% 77.3% 0.659 1.97e-04 BEST (final)
11 4.199 78.9% 66.1% 48.5% 78.0% 0.655 1.96e-04 pat 1
12 4.073 79.4% 68.9% 49.1% 78.7% 0.652 1.94e-04 pat 2
13 4.128 79.5% 66.8% 49.9% 77.4% 0.652 1.92e-04 pat 3
14 4.014 80.1% 67.8% 50.1% 77.2% 0.653 1.89e-04 pat 4
15 4.083 80.4% 65.6% 50.7% 78.1% 0.652 1.87e-04 pat 5
16 4.010 81.3% 68.8% 51.5% 77.1% 0.666 1.84e-04 pat 6
17 4.139 80.9% 66.3% 52.1% 77.8% 0.659 1.80e-04 pat 7
18 4.062 81.8% 68.1% 52.8% 77.5% 0.659 1.77e-04 pat 8
19 4.134 81.7% 66.0% 53.4% 77.8% 0.654 1.73e-04 pat 9
20 4.208 82.2% 69.2% 53.8% 77.5% 0.660 1.69e-04 pat 10
21 4.112 82.4% 68.9% 54.5% 77.5% 0.653 1.64e-04 pat 11
22 4.138 83.3% 68.6% 55.1% 78.0% 0.650 1.60e-04 pat 12
23 4.208 82.8% 67.6% 55.2% 77.3% 0.649 1.55e-04 pat 13
24 4.169 83.1% 68.9% 55.6% 77.0% 0.654 1.50e-04 pat 14
25 4.358 83.8% 65.1% 56.4% 78.0% 0.652 1.45e-04 EARLY STOP

Final Results

  • Best val loss: 3.988 at epoch 10 (best weights saved)
  • Val class accuracy: 66.3% (4-way CATH class), Val architecture accuracy: 31.8% (38+ architectures)
  • Contact recall: 77.3%, Contact BCE: 0.659 — model successfully learned spatial proximity from sequence
  • Train class accuracy: 78.1%, Train arch accuracy: 47.8%
  • Early stopped at epoch 25 (patience 15) — val loss plateaued after epoch 10
  • 1.2M params, trained from scratch on CATH 4.2 (18k proteins)
  • Training survived 7 SLURM job allocations with checkpoint resume

Training Curves

Contact classifier training curves

Contact Map Predictions (3 test proteins)

Ground truth vs predicted contact maps

Each row shows a held-out test protein from a different CATH structural class. The left column is the ground truth contact map (binary: two Cα atoms < 8Å apart), and the right column is the model’s predicted probability of contact from sequence alone. Metrics (precision P, recall R, and Top-L long-range accuracy) are annotated on each prediction panel.

  • 1bf0.A (L=60, Few Secondary Structure): A small protein with sparse, irregular contacts. The model captures the overall topology despite limited structural regularity.
  • 3ggm.A (L=81, Mainly Beta): Beta-sheet proteins produce characteristic off-diagonal block patterns from strand–strand hydrogen bonding. The model recovers these long-range parallel and anti-parallel strand pairings well.
  • 1f9x.A (L=120, Mainly Alpha): Alpha-helical proteins show strong banded diagonal patterns from helix-internal i→i+4 contacts. The model reproduces both the local helical periodicity and inter-helix contacts at larger separations.

These results demonstrate that a 1.2M-parameter transformer encoder trained from scratch on CATH 4.2 (~18k proteins) can learn meaningful spatial proximity signals across all major fold classes — without any pretrained language model or evolutionary information.

Learned Embedding Space (PCA & UMAP)

PCA and UMAP of learned protein embeddings colored by CATH class and architecture

Attention-pooled protein embeddings (128-dim) from the encoder’s val+test set, projected via PCA and UMAP. The encoder learns to separate CATH classes without explicit contrastive loss — mainly-alpha and mainly-beta proteins form distinct clusters, while alpha-beta proteins span the intermediate region. UMAP reveals finer sub-structure at the architecture level, with several CATH architectures forming tight, well-separated clusters (e.g., 3.40 Rossmann fold, 1.10 orthogonal bundle). This confirms the multi-task training objective (classification + contact prediction) produces structurally meaningful representations suitable for conditioning the downstream diffusion model.

Protein Backbone Diffusion Model (Stage 2 — v10)

Training
Goal: Generate realistic protein backbone structures (Cα coordinates) conditioned on sequence, using contact-aware embeddings from Stage 1.
Architecture ContactConditionedPairStack + Denoiser9 (8 EGNN layers, dx_clamp=0.5) + RgPredictor — ~14.6M params total (1.2M frozen encoder + 13.4M trainable)
Key innovations over v8 Contact map probabilities injected via gated projection into pair representation; contact attention bias in denoiser; FAPE loss (frame-aligned point error); predicted radius of gyration instead of fixed Rg=10; 8 denoiser layers (was 6), dx_clamp=0.5

v8 Baseline

  • Val loss 2.086, RMSD ~14.5Å, TM-score ~0.13
  • The information bottleneck from the frozen classifier was identified as the main limitation

Run 1 (Epochs 1–22) — Failed

  • Best val loss 4.82 at epoch 9 (49% reduction from E1), then catastrophic instability at E10–11: bond geometry spiked from 0.14 to 2.4, clash from 5.3 to 12.8.
  • Root causes identified: (1) bond weight annealed in the wrong direction (high→low instead of low→high), (2) cosine LR decayed to 1e-6 floor leaving no recovery capacity, (3) dx_clamp=1.0 amplified instability across 8 layers, (4) loss clamps at 100.0 too generous, (5) clash weight too low (0.5).
  • DDIM sampling never exceeded TM=0.041 (E5). Model early-stopped at E24.

Run 2 (Epochs 1–5) — Failed

  • Applied 7 fixes from Run 1 diagnosis (reversed bond annealing, dx_clamp=0.5, tighter clamping, aux loss, EMA DDIM, etc.).
  • w_clash=2.0 dominated the total loss (~4.8 of ~11 total), drowning out structural signals. LR lacked warmup and started decaying immediately.
  • Val total spiked at E5 (11.0→14.9). DDIM TM declined from 0.035 (E1) to 0.025 (E3). Cancelled after 5 epochs.

v9 Run 3 (Epochs 1–7) — Failed

  • Best: E1 DDIM TM=0.097, RMSD=16.3A (best ever). Val loss improved to 8.33 by E4.
  • E7 val total spiked to 19.6. Clash loss still dominated (~47% of total at w_clash=1.0). DDIM TM declined steadily: 0.097→0.049→0.028.
  • Root cause: clash threshold of 3.0A applied in Rg-normalized coordinates maps to ~30A in real space, penalizing nearly all non-bonded pairs with noisy gradients.

v10 — Training Progress (current)

v10 changes (final)

  • Clash loss removed entirely (was broken in Rg-normalized coordinate space: 3.0Å threshold mapped to ~30Å real space, penalizing nearly all atom pairs with noise)
  • FAPE now uses all residues as targets (was random 16 out of 125, causing high gradient variance epoch-to-epoch)
  • DistanceHead pair input undetachedreverted at E5 (see incident report below)
  • Auxiliary distance CE disabled (w_aux=0.0) at E15 — pair representation drift caused exponential divergence (see second incident report below)
  • Patience tracking switched to structural-only loss (excludes aux contribution) to prevent inflated val_total from triggering premature early stopping
  • All prior fixes retained: reversed bond annealing (low→high over 15 epochs), dx_clamp=0.5, tighter loss clamping, 5-epoch linear warmup + cosine restarts

Incident Report: aux_dist_ce Divergence (E6–E12)

What happened: At E6, the auxiliary distance cross-entropy loss (aux_dist_ce) began diverging exponentially: 3.95 → 43.8 → 86.6 → 77.9, eventually reaching 79+ and saturating. By E12, the model was generating random atom clouds (DDIM TM=0.015, RMSD=52.6Å). All structural losses were corrupted.

Root cause: Removing pair.detach() from the DistanceHead input allowed aux_dist_ce gradients to flow back through the ContactConditionedPairStack. This created a positive feedback loop: aux gradients destabilized the pair representation → worse distance predictions → larger aux loss → even larger gradients. The pair representation feeds both the aux head and the denoiser, so the corruption spread to all structural losses by E8.

Fix applied: Re-detached the pair stack (pair.detach()), deleted corrupted checkpoints (E10–E12), and restarted from the E4 best checkpoint (val=2.247) with a fresh optimizer. Training immediately resumed smooth improvement, setting new bests at E7 (2.129) and E8 (2.086).

Trade-off analysis: Detaching the pair stack means aux_dist_ce only trains the DistanceHead MLP, not the pair representation itself. In principle, end-to-end training of the pair stack through aux loss could improve the learned pair features. In practice, the aux loss magnitude (~4.0) is much larger than structural losses (~0.3–1.4), and the cross-entropy gradients are poorly scaled relative to the MSE/FAPE gradients that the pair stack was designed for. The detach acts as a gradient firewall — the pair stack learns from structural losses (which are well-calibrated), while aux_dist_ce provides an independent distance prediction that regularizes the DistanceHead without interfering. This is the safer and empirically superior design. A future approach could use a much smaller aux weight (w_aux=0.01–0.03) or gradient scaling to enable partial end-to-end training without instability.

Incident Report: Second aux_dist_ce Divergence & Resolution (E14–E16)

What happened: After the pair detach fix stabilized training through E13, aux_dist_ce began diverging again at E14: 8.9 → 17.5 → 38.8 (doubling every epoch). Structural losses remained stable and excellent (dst=0.22, fape=1.35, bond=0.006), confirming this was isolated to the DistanceHead.

Root cause: Feature distribution shift. The pair representation is detached before the DistanceHead, so no gradient flows back — but the pair stack continued evolving via structural losses (dist_mse, FAPE). During the "topology learning phase" (E10–E13), FAPE began breaking below random, causing rapid pair representation changes. The DistanceHead's learned feature→bin mapping became stale, producing confidently wrong predictions (CE ≫ ln(96) = 4.56). This is analogous to a classifier trained on features from a "frozen" encoder where the encoder is actually being updated by a different objective.

Impact: Even at the reduced weight of w_aux=0.03, the contribution to total loss grew from 0.27 (E14) to 1.16 (E16), exceeding the structural loss (~0.83). This inflated val_total and corrupted patience tracking, threatening premature early stopping despite continued structural improvement.

Fix applied (E15):

  • Set w_aux=0.0 — aux_dist_ce removed from loss entirely
  • Patience tracking switched to structural-only loss: val_structural = val_total - w_aux * aux_dist_ce
  • Rolled back to E14 checkpoint, reset best_val=0.808 (structural-only), patience=0
  • aux_dist_ce still computed and logged as a free diagnostic

Result: E15 (first epoch post-fix) set a new best with val_structural=0.807. Training total dropped from 1.787 (E16 with aux) to 0.821, reflecting pure structural signal. DDIM metrics: TM=0.129, RMSD=14.53Å.

Lesson: Auxiliary heads that passively observe evolving representations via stop-gradient are inherently fragile. The DistanceHead architecture requires either (a) its own independent pair stack with end-to-end training, or (b) a loss function robust to feature drift (e.g., ordinal regression or soft-label CE instead of hard 96-bin classification). Planned for v11.

Epoch Val Structural Val Dist MSE Val FAPE Val Rg Loss Val Bond Geom DDIM RMSD DDIM TM-score Status
1 2.938 0.536 1.311 0.232 0.167 15.49Å 0.101 NEW BEST
2 2.429 0.433 1.380 0.045 0.035 NEW BEST
3 2.340 0.450 1.344 0.040 0.014 15.25Å 0.117 NEW BEST
4 2.247 0.355 1.352 0.039 0.016 NEW BEST
5 2.246 0.372 1.352 0.039 0.013 NEW BEST
6 2.248 0.380 1.366 0.039 0.010 15.91Å 0.108 pat 1
7 2.129 0.267 1.381 0.038 0.011 NEW BEST
8 2.086 0.246 1.343 0.038 0.011 NEW BEST
9 2.106 0.241 1.351 0.037 0.013 14.92Å 0.125 pat 1
10 2.053 0.250 1.301 0.038 0.008 NEW BEST
11 2.024 0.237 1.325 0.037 0.008 NEW BEST
12 2.028 0.232 1.342 0.036 0.007 14.58Å 0.131 pat 1
13 2.006 0.223 1.354 0.036 0.006 NEW BEST
14 0.808 0.220 1.324 0.035 0.006 NEW BEST w_aux→0.0
15 0.807 0.217 1.350 0.035 0.006 14.53Å 0.129 NEW BEST

Loss Reference: Random Baseline & Interpretation

For cross-entropy losses, random performance = ln(N) where N is the number of classes (a uniform predictor assigns 1/N probability to the correct bin, giving −ln(1/N) = ln(N)). For MSE losses, random = E1 value (Gaussian noise baseline). Solid lines = train, dashed = val, dotted = not in loss (w=0).

Metric w Type Random Best Interpretation
Structural Σ weighted ~2.94 0.807 ↓ Weighted sum of structural components (excludes aux). Since E15, patience tracks this metric.
Dist MSE 1.0 MSE ~0.54 0.217 ↓ Pairwise Cα distance error. 60% below random. <0.1 = sub-Å accuracy.
Bond 5.0* MSE ~0.17 0.006 ↓ Cα–Cα bond length error. Essentially solved. *annealed 1→5 over 15 epochs.
FAPE 0.3 L1 ~1.31 1.324 ↓ Frame-aligned position error (all residues). Just below random — topology learning starting. Drops <1.0 with correct folds.
Rg 0.5 MSE ~0.23 0.035 ↓ Radius of gyration error. Well-learned, correct protein size/compactness.
Chirality 0.1 MSE ~0.54 0.482 ↓ Signed volume (dihedral handedness) of CA quartets. 11% below random.
Angle 0.5 MSE ~0.70 0.181 ↓ CA–CA–CA bond angle error (cosine). 74% below random, excellent.
Aux Dist CE 0.0 CE ln(96) = 4.56 3.95 (--) Disabled at E15. Logged only. Diverged due to pair representation drift (see incident report). Was 0.3→0.03→0.0.
Clash 0.0 penalty ~14.0 11.6 (--) Logged but not in loss (w=0). Disabled: 3Å threshold in Rg-space maps to ~30Å real.
TM-score DDIM ~0.10 0.131 ↑ 50-step DDIM sampling. >0.17 = recognizable folds. Target: >0.30.
RMSD DDIM ~15.5Å 14.53Å ↓ <10Å = partial fold. <5Å = high quality.

Loss Curves (through Epoch 15)

V10 diffusion training loss curves
v10: 15 epochs completed. Best val structural loss: 0.807 at epoch 15. Patience: 0/15. Best DDIM TM-score: 0.131 (E12). aux_dist_ce disabled (w=0.0) at E15 after exponential divergence due to pair representation drift; patience now tracks structural-only loss. Running on single A40 GPU (savio3) with auto-resume.
Last updated: 2026-03-09 06:30 UTC

Diffusion v10 — Architecture & Loss Function

Overview

We train a denoising diffusion model for protein backbone (Cα) structure generation, conditioned on inter-residue contact maps predicted by a frozen ContactClassifier encoder. The model operates in Rg-normalized coordinate space: all coordinates are divided by the radius of gyration so the diffusion process is scale-invariant. The denoiser is an 8-layer SE(3)-equivariant graph neural network (EGNN) with 14.6M parameters (13.4M trainable, 1.2M frozen encoder). Training uses the CATH dataset (18,024 train / 608 val proteins, max 125 residues).

Total Loss

The total loss is a weighted combination of eight components. At epoch \(e\):

$$\mathcal{L}_{\text{total}} = w_{\text{dist}} \cdot \mathcal{L}_{\text{dist}} + \beta(e) \cdot \mathcal{L}_{\text{bond}} + w_{\text{aux}} \cdot \mathcal{L}_{\text{aux}} + w_{\chi} \cdot \mathcal{L}_{\chi} + w_{\theta} \cdot \mathcal{L}_{\theta} + w_{\text{fape}} \cdot \mathcal{L}_{\text{fape}} + w_{\text{rg}} \cdot \mathcal{L}_{\text{rg}}$$

Clash loss and auxiliary distance CE are logged but excluded (\(w_{\text{clash}} = 0\), \(w_{\text{aux}} = 0\)). See incident reports for rationale.

1. Distance MSE  \(w_{\text{dist}} = 1.0\)

Mean squared error on all pairwise Cα distances in Rg-normalized space:

$$\mathcal{L}_{\text{dist}} = \frac{1}{|\mathcal{M}|} \sum_{(i,j) \in \mathcal{M}} \left( \| \hat{x}_i^{(0)} - \hat{x}_j^{(0)} \| - \| x_i^{(0)} - x_j^{(0)} \| \right)^2$$

where \(\hat{x}^{(0)}\) is the predicted clean structure, \(x^{(0)}\) is the ground truth, both in Rg-normalized coordinates, and \(\mathcal{M}\) is the set of valid residue pairs. Clamped to max 10.0.

Random baseline: ~0.54 (MSE of Gaussian noise pairwise distances vs true).

2. Bond Geometry  \(\beta(e) = \min(5.0,\; 1.0 + 4.0 \cdot \min(e/15, 1))\)

MSE on consecutive Cα–Cα distances against the ideal 3.8Å bond length (in Rg-normalized space):

$$\mathcal{L}_{\text{bond}} = \frac{1}{L-1} \sum_{i=1}^{L-1} \left( \| \hat{x}_i^{(0)} - \hat{x}_{i+1}^{(0)} \| - \frac{3.8}{R_g} \right)^2$$

The weight is annealed from 1.0 to 5.0 over the first 15 epochs. Starting low prevents bond geometry from dominating early training when the model hasn't learned global structure. As training matures, the increasing weight enforces physically valid backbone geometry. Clamped to max 10.0.

Random baseline: ~0.17. Below 0.02 indicates bonds are within 0.1Å of the ideal 3.8Å spacing.

3. FAPE (Frame-Aligned Point Error)  \(w_{\text{fape}} = 0.3\)

Measures local structural consistency by constructing rigid frames from consecutive Cα triplets and computing the error in each frame's local coordinate system:

$$\mathcal{L}_{\text{fape}} = \frac{1}{N_f \cdot L} \sum_{f=1}^{N_f} \sum_{j=1}^{L} \min\!\Big( \| R_f^{\top}(\hat{x}_j - o_f) - R_f^{*\top}(x_j - o_f^*) \|,\; d_{\text{clamp}} \Big)$$

Frames are built from every other triplet of Cα atoms: the x-axis along \(c_2 - c_0\), z-axis from the cross product, y-axis completing the right-handed system. Unlike AlphaFold's random 14-residue sampling, v10 uses all residues as targets for stable gradients. Clamped at \(d_{\text{clamp}} = 10.0\).

Random baseline: ~1.31. Drops below 1.0 when the model learns correct fold topology. This is the hardest loss to reduce because it requires global structural correctness, not just local geometry.

4. Radius of Gyration  \(w_{\text{rg}} = 0.5\)

MSE on log-transformed radius of gyration predictions for scale invariance:

$$\mathcal{L}_{\text{rg}} = \left( \log \hat{R}_g - \log R_g \right)^2$$

Since the denoiser works in Rg-normalized space, a separate MLP (\(\texttt{RgPredictor}\)) predicts the absolute radius of gyration from sequence embeddings. This allows recovering real-space coordinates at inference: \(x_{\text{real}} = \hat{R}_g \cdot \hat{x}_{\text{norm}}\).

Random baseline: ~0.23. Converges below 0.05 by E2.

5. Chirality  \(w_{\chi} = 0.1\)

MSE on normalized signed volumes (scalar triple products) of consecutive Cα quartets, ensuring correct backbone handedness:

$$\mathcal{L}_{\chi} = \frac{1}{L-3} \sum_{i=1}^{L-3} \left( \frac{\mathbf{v}_1 \cdot (\mathbf{v}_2 \times \mathbf{v}_3)}{\|\mathbf{v}_1\| \|\mathbf{v}_2\| \|\mathbf{v}_3\|} \bigg|_{\hat{x}} - \frac{\mathbf{v}_1 \cdot (\mathbf{v}_2 \times \mathbf{v}_3)}{\|\mathbf{v}_1\| \|\mathbf{v}_2\| \|\mathbf{v}_3\|} \bigg|_{x} \right)^2$$

where \(\mathbf{v}_1 = x_{i+1} - x_i\), \(\mathbf{v}_2 = x_{i+2} - x_{i+1}\), \(\mathbf{v}_3 = x_{i+3} - x_{i+2}\). The normalization maps volumes to \([-1, 1]\), making the loss scale-invariant. Natural proteins are L-amino acids with consistent chirality; this loss prevents mirror-image structures.

Random baseline: ~0.54 (expected MSE for random values in [-1, 1]).

6. Bond Angle  \(w_{\theta} = 0.5\)

MSE on cosines of Cα–Cα–Cα bond angles for consecutive triplets:

$$\mathcal{L}_{\theta} = \frac{1}{L-2} \sum_{i=1}^{L-2} \left( \cos\hat{\theta}_i - \cos\theta_i \right)^2, \quad \cos\theta_i = \frac{(\hat{x}_{i+1} - \hat{x}_i) \cdot (\hat{x}_{i+2} - \hat{x}_{i+1})}{\| \hat{x}_{i+1} - \hat{x}_i \| \| \hat{x}_{i+2} - \hat{x}_{i+1} \|}$$

The ideal Cα–Cα–Cα angle in proteins is ~120° (\(\cos\theta \approx -0.5\)). Working in cosine space avoids discontinuities at 0°/360°.

Random baseline: ~0.70. Below 0.1 indicates correct backbone angles.

7. Auxiliary Distance Cross-Entropy  \(w_{\text{aux}} = 0.0\) (disabled)

Cross-entropy loss on binned pairwise distances, predicted by a DistanceHead MLP from the detached pair representation:

$$\mathcal{L}_{\text{aux}} = -\frac{1}{|\mathcal{M}'|} \sum_{(i,j) \in \mathcal{M}'} \log p_{ij}\big[\text{bin}(d_{ij})\big], \quad \text{bin}(d) = \left\lfloor \frac{d - d_{\min}}{\Delta} \right\rfloor, \quad \Delta = \frac{d_{\max} - d_{\min}}{N_{\text{bins}}}$$

Parameters: \(N_{\text{bins}} = 96\), \(d_{\min} = 2\)Å, \(d_{\max} = 40\)Å, \(\Delta = 0.396\)Å/bin. Distances are computed in real space (\(d_{ij} = R_g \cdot \|x_i - x_j\|\)) then binned. Only pairs within \(d_{\max}\) are included.

The pair representation is detached before entering the DistanceHead, meaning aux gradients train only the MLP head, not the pair stack. This is critical — an earlier experiment without detach caused exponential divergence (see incident report in Training tab).

Random baseline: \(\ln(96) = 4.56\). A uniform predictor assigns \(1/96\) to each bin, giving \(-\ln(1/96) = \ln(96)\). Weight history: 0.3 (E1–E13) → 0.03 (E14) → 0.0 (E15+, disabled). Disabled because the pair representation evolves via structural losses while the DistanceHead observes it through a stop-gradient wall, causing irrecoverable feature drift and exponential CE divergence. Still computed and logged as a diagnostic.

8. Clash Loss  \(w_{\text{clash}} = 0.0\) (disabled)

Contact-guided steric clash penalty on non-bonded atoms closer than 3.0Å:

$$\mathcal{L}_{\text{clash}} = \frac{1}{|\mathcal{N}|} \sum_{(i,j) \in \mathcal{N}} \left[ (1 + 2 \cdot c_{ij}) \cdot \text{ReLU}(3.0 - d_{ij}) \right]^2$$

where \(\mathcal{N}\) is the set of non-bonded pairs (\(|i - j| > 1\)), \(c_{ij}\) is the predicted contact probability (detached), and \(d_{ij}\) is the Rg-normalized distance.

Why disabled: The 3.0Å threshold is applied in Rg-normalized coordinates, but for a typical protein with \(R_g \approx 10\)Å, this maps to \(3.0 \times 10 = 30\)Å in real space — penalizing nearly all non-bonded pairs. This produced noisy gradients that dominated ~47% of the total loss in v9, drowning the structural learning signal. Removing clash led to immediate training stability and the structural losses handle steric quality implicitly (clash decreases organically as the model generates better structures).

Training Configuration

OptimizerAdamW (\(\beta_1=0.9, \beta_2=0.999\))
Peak learning rate\(10^{-4}\)
LR schedule5-epoch linear warmup (\(0.01 \times \text{lr} \to \text{lr}\)), then CosineAnnealingWarmRestarts (\(T_0 = 15\) epochs, \(\eta_{\min} = 10^{-5}\))
Batch size8 (grad accumulation = 2, effective = 16)
Mixed precisionAMP with GradScaler, \(\texttt{dx\_clamp} = 0.5\)
EMADecay = 0.999, used for DDIM evaluation
Self-conditioning50% probability during training
Early stoppingPatience = 15 epochs on val structural loss (excludes aux_dist_ce)
DDIM evaluation50 steps, every 3 epochs, using EMA weights
HardwareSingle NVIDIA A40 (48 GB), ~15 min/epoch
DatasetCATH 4.3: 18,024 train / 608 val, max 125 residues

DDIM Evaluation Metrics

Every 3 epochs, we generate structures via 50-step DDIM sampling using EMA weights and evaluate against ground truth:

  • TM-score (Template Modeling): Global fold similarity, range [0, 1]. Scores > 0.17 indicate the same fold; > 0.5 is the same topology. Random structures score ~0.10.
  • RMSD (Root Mean Square Deviation): Average atomic displacement after optimal superposition. Random placement gives ~15–16Å. Below 5Å is high quality.
  • GDT (Global Distance Test): Fraction of residues within 1–8Å of the true position. Random gives ~3–4%.