Sergio E. Mares
Computational Biology Ph.D. Candidate at UC Berkeley

I am a fifth year Computational Biology Ph.D. Candidate at UC Berkeley, advised by Professor Nilah Ioannidis at the Center for Computational Biology and Professor Joseph Costello at the UCSF Neurosurgery Department.


My research focuses on building machine learning models for cancer immunotherapy. I develop protein language models for predicting peptide-MHC class I binding affinity and use structure-conditioned diffusion models to design novel immunogenic peptide libraries. The goal is to expand the space of targetable tumor antigens, particularly for brain tumors where current therapeutic options are limited.


In Summer 2025 I interned at Ultima Genomics, where I integrated a DNA sequence simulation into the production sequencing pipeline (reducing reagent use by 50%) and built a scalable single-cell ATAC-seq processing pipeline handling up to 100M cells end-to-end.

Sergio E. Mares
pMHC-I binding
Sergio E. Mares, Ariel Espinoza, Nilah M. Ioannidis
Machine Learning in Computational Biology (MLCB), 2025
We test whether domain-specific continued pre-training of protein language models is beneficial for pMHC-I binding affinity prediction. Starting from ESM Cambrian (300M parameters), we perform masked-language modeling on HLA-associated peptides and fine-tune for quantitative IC50 binding affinity prediction.
Structure-guided pMHC-I design
Sergio E. Mares, Ariel Espinoza, Nilah M. Ioannidis
ICML Gen AI and Biology Workshop, 2025
We introduce a structure-guided benchmark of pMHC-I peptides designed using diffusion models conditioned on crystal structure interaction distances, spanning twenty high-priority HLA alleles.
Calcium signaling protein structure
Biraj B. Kayastha, A. Kubo, J. Burch-Konda, R. L. Dohmen, J. L. McCoy, R. R. Rogers, Sergio E. Mares, J. Bevere, A. Huckaby, W. Witt, S. Peng, B. Chaudhary, S. Mohanty, M. Barbier, G. Cook, J. Deng, M. Patrauchan
Nature Scientific Reports, 2022
We study the putative Ca²+-binding protein EfhP (PA4107) and CalC as proteins involved in the calcium network, elucidating the mechanisms of bacterial Ca²+ signaling in Pseudomonas aeruginosa.
Baculovirus invadosome dynamics
Domokos I. Lauko, Taro Ohkawa, Sergio E. Mares, Matthew D. Welch
Molecular Biology of the Cell, 2021
We investigate how AcMNPV protein actin rearrangement inducing factor-1 (Arif-1) induces the formation of cortical concentrations of polymerized actin (ventral aggregates) in cultured insect cells.
Pseudomonas aeruginosa
Sergio E. Mares, M. King, A. Kubo, A. Khavov, E. Lutter, N. Youssef, M. Patrauchan
Journal of Microbiology, 2020
We study the conservation of carP sequence and its occurrence in diverse phylogenetic groups, finding that carP and its two paralogues are primarily present in P. aeruginosa and belong to the core genome, demonstrating potential as a biomarker.
Myxococcota swarming
Chelsea L. Murphy, R. Yang, T. Decker, C. Cavalliere, V. Andreev, N. Bircher, J. Cornell, R. Dohmen, C. J. Pratt, A. Grinnell, J. Higgs, C. Jett, E. Gillett, R. Khadka, Sergio E. Mares, C. Meili, J. Liu, H. Mukhtar, Mostafa S. Elshahed, Noha H. Youssef
Environmental Microbiology, 2021
Detailed analysis of 13 distinct pathways crucial to predation and cellular differentiation reveals severely curtailed machineries, proposing that these represent a niche adaptation strategy that evolved circa 500 million years ago.
Blog post image
Teaching a Protein Language Model to Speak "Immune"
February 2026
A walkthrough of our MLCB 2025 paper on continued pre-training of protein language models for pMHC-I binding prediction — why we did it, how it works, and what surprised us.
Blog post image
What If We Could Design Immune Peptides from Scratch — Using Physics Instead of Data?
February 2026
A walkthrough of our ICML 2025 workshop paper on generating pMHC-I libraries with diffusion models — the dataset bias problem, our structure-first approach, and why existing predictors completely failed on our designed peptides.

A collection of informal reviews of papers I find interesting — mostly in the protein structure prediction, protein design, and protein language model space. These are from Sergey Ovchinnikov's lab and related groups. Just my thoughts, nothing too formal.

Protein Diffusion Models as Statistical Potentials
Roney, Ou, Ovchinnikov · bioRxiv 2025
What if we could repurpose protein diffusion models as energy functions? ProteinEBM does exactly that — turning a generative model into a scoring function that can rank structures, predict conformational landscapes, and estimate mutation effects.
Designing Novel Solenoid Proteins with In Silico Evolution
Pretorius, Nikov, Washio, Florent, Taunt, Ovchinnikov, Murray · Communications Chemistry 2025
Solenoid proteins are nature's modular building blocks. This paper uses AlphaFold2 as an oracle inside a genetic algorithm to design entirely new solenoid folds — and 20% of them actually work in the lab.
CIRPIN: Learning Circular Permutation-Invariant Representations to Uncover Putative Protein Homologs
Kolodziej, Abulnaga, Ovchinnikov · bioRxiv 2025
Most structure comparison tools miss proteins that are related by circular permutation. CIRPIN fixes this with a clever graph neural network that doesn't care where the chain starts — uncovering thousands of hidden evolutionary relationships.
Hit or Miss: Understanding Emergence and Absence of Homo-oligomeric Contacts in Protein Language Models
Zhang, Akiyama, Cho, Jajoo, Ovchinnikov · bioRxiv 2025
Protein language models are trained on single chains, yet they somehow learn about protein-protein interfaces. This paper digs into how and why — and finds that bigger models keep getting better at inter-chain contacts even after intra-chain accuracy plateaus.
Assessing the Utility of Coevolution-Based Residue–Residue Contact Predictions in a Sequence- and Structure-Rich Era
Kamisetty, Ovchinnikov, Baker · PNAS 2013
The 2013 paper that helped establish when coevolution-based contact prediction is actually useful. A foundational work that set the stage for everything from direct coupling analysis to AlphaFold.

More papers I find interesting

De Novo Design of Protein Structure and Function with RFdiffusion
Watson, Juergens, Bennett et al. · Nature 2023
The paper that brought diffusion models to protein design in a big way. RFdiffusion generates protein backbones from scratch and can design binders, symmetric assemblies, and enzyme scaffolds — many validated experimentally.
Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model (ESMFold)
Lin, Abanades, Rao, Johnson, Rives et al. · Science 2023
What if you could predict protein structure from a single sequence, no alignment needed? ESMFold does this at AlphaFold-like accuracy with a 15 billion parameter language model, enabling structure prediction for 600+ million metagenomic proteins.
Simulating 500 Million Years of Evolution with a Language Model (ESM3)
Hayes, Rao, Akin et al. · Science 2025
ESM3 is a 98-billion-parameter multimodal model that reasons over protein sequence, structure, and function simultaneously. It designed a novel fluorescent protein with only 58% identity to anything in nature — equivalent to 500 million years of evolution.
Accurate Structure Prediction of Biomolecular Interactions with AlphaFold 3
Abramson, Adler, Dunger et al. · Nature 2024
AlphaFold 3 moves beyond proteins to predict the structures of complexes involving DNA, RNA, small molecules, and ions — with a diffusion-based architecture that substantially outperforms specialized tools for drug-like interactions.
Protein Language Models Learn Evolutionary Statistics of Interacting Sequence Motifs
Zhang, Wayment-Steele, Brixi, Wang, Kern, Ovchinnikov · PNAS 2024
What do protein language models actually learn? This paper shows ESM-2 stores coevolutionary statistics as motifs of pairwise contacts — bridging the gap between classical coevolution and modern deep learning.
Molecular Modeling and Simulation
Molecular Modeling and Simulation: An Interdisciplinary Guide
Tamar Schlick
Finished
Pedro Páramo
Pedro Páramo
Juan Rulfo
Currently Reading
On the Origin of Species
On the Origin of Species
Charles Darwin
Currently Reading
Structural Bioinformatics
Structural Bioinformatics
Philip E. Bourne & Helge Weissig
Currently Reading
Soviet Middlegame Technique
Soviet Middlegame Technique
Peter Romanovsky
Currently Reading
Miles de millones
Miles de millones
Carl Sagan
Currently Reading
Cien años de soledad
Cien años de soledad
Gabriel García Márquez
Currently Reading
♘ Chess
I've been playing chess since I moved to the US. I mainly play rapid and blitz on Lichess. Feel free to challenge me!
💻 Open Source
Building tools at the intersection of ML and biology. Check out my projects on GitHub.

Last updated: March 9, 2026

Contact Classifier (Stage 1 — Multi-task Encoder)

Complete
Goal: Train a transformer encoder from scratch on CATH 4.2 dataset (18k proteins) to jointly predict CATH class/architecture labels AND inter-residue contact maps from sequence alone. The learned embeddings encode spatial proximity information needed for Stage 2.
Architecture ContactClassifier — 1.2M params, dim=128, 2 transformer towers, d_pair=64, 2 contact prediction blocks with outer product mean
Training Single GPU (1080 Ti), batch_size=24, lr=2e-4 with warmup cosine schedule, patience=15 early stopping
Resilience Per-epoch checkpoints with auto-resume, CSV loss logging, self-resubmitting watchdog system on SLURM

Training Progress (all 25 epochs — early stopped)

Epoch Val Total Loss Train Class Acc Val Class Acc Train Arch Acc Contact Recall (Val) Contact BCE (Val) LR Status
1 4.846 47.5% 41.8% 16.1% 69.6% 0.759 3.33e-05 NEW BEST
2 4.626 58.2% 54.9% 25.4% 72.3% 0.725 6.67e-05 NEW BEST
3 4.454 65.9% 53.8% 33.8% 73.9% 0.702 1.00e-04 NEW BEST
4 4.356 70.0% 58.4% 39.2% 73.8% 0.693 1.33e-04 NEW BEST
5 4.354 72.9% 60.7% 42.4% 73.7% 0.691 1.67e-04 NEW BEST
6 4.220 74.4% 65.1% 43.7% 75.5% 0.668 2.00e-04 NEW BEST
7 4.321 75.9% 64.8% 44.5% 75.9% 0.686 2.00e-04 pat 1
8 3.994 77.3% 65.5% 46.4% 77.4% 0.660 1.99e-04 NEW BEST
9 3.998 77.2% 66.0% 47.0% 78.0% 0.665 1.98e-04 pat 1
10 3.988 78.1% 66.3% 47.8% 77.3% 0.659 1.97e-04 BEST (final)
11 4.199 78.9% 66.1% 48.5% 78.0% 0.655 1.96e-04 pat 1
12 4.073 79.4% 68.9% 49.1% 78.7% 0.652 1.94e-04 pat 2
13 4.128 79.5% 66.8% 49.9% 77.4% 0.652 1.92e-04 pat 3
14 4.014 80.1% 67.8% 50.1% 77.2% 0.653 1.89e-04 pat 4
15 4.083 80.4% 65.6% 50.7% 78.1% 0.652 1.87e-04 pat 5
16 4.010 81.3% 68.8% 51.5% 77.1% 0.666 1.84e-04 pat 6
17 4.139 80.9% 66.3% 52.1% 77.8% 0.659 1.80e-04 pat 7
18 4.062 81.8% 68.1% 52.8% 77.5% 0.659 1.77e-04 pat 8
19 4.134 81.7% 66.0% 53.4% 77.8% 0.654 1.73e-04 pat 9
20 4.208 82.2% 69.2% 53.8% 77.5% 0.660 1.69e-04 pat 10
21 4.112 82.4% 68.9% 54.5% 77.5% 0.653 1.64e-04 pat 11
22 4.138 83.3% 68.6% 55.1% 78.0% 0.650 1.60e-04 pat 12
23 4.208 82.8% 67.6% 55.2% 77.3% 0.649 1.55e-04 pat 13
24 4.169 83.1% 68.9% 55.6% 77.0% 0.654 1.50e-04 pat 14
25 4.358 83.8% 65.1% 56.4% 78.0% 0.652 1.45e-04 EARLY STOP

Final Results

  • Best val loss: 3.988 at epoch 10 (best weights saved)
  • Val class accuracy: 66.3% (4-way CATH class), Val architecture accuracy: 31.8% (38+ architectures)
  • Contact recall: 77.3%, Contact BCE: 0.659 — model successfully learned spatial proximity from sequence
  • Train class accuracy: 78.1%, Train arch accuracy: 47.8%
  • Early stopped at epoch 25 (patience 15) — val loss plateaued after epoch 10
  • 1.2M params, trained from scratch on CATH 4.2 (18k proteins)
  • Training survived 7 SLURM job allocations with checkpoint resume

Training Curves

Contact classifier training curves

Contact Map Predictions (3 test proteins)

Ground truth vs predicted contact maps

Each row shows a held-out test protein from a different CATH structural class. The left column is the ground truth contact map (binary: two Cα atoms < 8Å apart), and the right column is the model’s predicted probability of contact from sequence alone. Metrics (precision P, recall R, and Top-L long-range accuracy) are annotated on each prediction panel.

  • 1bf0.A (L=60, Few Secondary Structure): A small protein with sparse, irregular contacts. The model captures the overall topology despite limited structural regularity.
  • 3ggm.A (L=81, Mainly Beta): Beta-sheet proteins produce characteristic off-diagonal block patterns from strand–strand hydrogen bonding. The model recovers these long-range parallel and anti-parallel strand pairings well.
  • 1f9x.A (L=120, Mainly Alpha): Alpha-helical proteins show strong banded diagonal patterns from helix-internal i→i+4 contacts. The model reproduces both the local helical periodicity and inter-helix contacts at larger separations.

These results demonstrate that a 1.2M-parameter transformer encoder trained from scratch on CATH 4.2 (~18k proteins) can learn meaningful spatial proximity signals across all major fold classes — without any pretrained language model or evolutionary information.

Learned Embedding Space (PCA & UMAP)

PCA and UMAP of learned protein embeddings colored by CATH class and architecture

Attention-pooled protein embeddings (128-dim) from the encoder’s val+test set, projected via PCA and UMAP. The encoder learns to separate CATH classes without explicit contrastive loss — mainly-alpha and mainly-beta proteins form distinct clusters, while alpha-beta proteins span the intermediate region. UMAP reveals finer sub-structure at the architecture level, with several CATH architectures forming tight, well-separated clusters (e.g., 3.40 Rossmann fold, 1.10 orthogonal bundle). This confirms the multi-task training objective (classification + contact prediction) produces structurally meaningful representations suitable for conditioning the downstream diffusion model.

Protein Backbone Diffusion Model (Stage 2 — v11)

Training
Goal: Generate realistic protein backbone structures (Cα coordinates) conditioned on sequence, using contact-aware embeddings from Stage 1. v11 replaces the EGNN denoiser with Invariant Point Attention (IPA) — the same architecture class used in AlphaFold2 and RFDiffusion.
Architecture IPA denoiser (6 layers, 8 heads, 8 query points) + independent aux pair stack (64-dim, 2 layers) + frozen ContactClassifier encoder — 8.4M params (7.2M trainable, 1.2M frozen)
Training Single A40 GPU (48 GB), batch_size=8, grad_accum=2 (eff=16), LR=5e-5 with 3-epoch warmup + cosine decay (no restarts), T=1000 timesteps, DDIM-50 eval every epoch
Dataset CATH 4.2: 18,024 train / 608 val proteins, max 125 residues

Why IPA? The v10 Ceiling

v10 used an 8-layer EGNN denoiser that learned pairwise distance statistics well (dist_mse 60% below random) but could not learn protein topology. FAPE stayed at its random baseline (~1.31) across 21 epochs and TM-score peaked at 0.131 (random ~0.10). EGNN has no concept of local reference frames — it reasons about distances, not backbone geometry. IPA solves this by maintaining and refining per-residue rigid-body frames (rotation + translation) through 3D point attention in local coordinate systems.

v11 Key Changes

  • IPA denoiser — each block does: invariant point attention (scalar + 3D point Q/K/V in local frames + pair bias) → transition MLP → frame update (quaternion + translation, identity-biased init, composed in local frame for SE(3) equivariance)
  • Fixed-scale coordinates — divide by 10Å instead of per-protein Rg normalization (eliminates protein-size-dependent noise schedule bias)
  • SNR-gated frame initialization — Gram-Schmidt frames from Cα at low noise, smooth slerp to identity at high noise where coordinates are near-isotropic
  • Frame-aware self-conditioning — 50% of steps: build clean frames from previous x0 prediction as initial frames for IPA refinement
  • Independent aux pair stack (64-dim, 2 layers) with ordinal regression (32 bins) — fixes v10’s feature-drift divergence where the detached DistanceHead saw a drifting pair representation

v11 Loss Weights

Loss v10 v11 Rationale
FAPE0.31.0IPA can actually optimize frame consistency
Bond5.03.0Gentler anneal (1→3), avoids tug-of-war
Clash0.00.1Re-enabled: fixed-scale makes 3.8Å threshold work
Aux dist0.00.03Conservative start for new independent pair stack
Dist MSE1.01.0
Chirality0.10.1
Angle0.50.5
Rg0.50.5

v11 Loss Curves

V11 diffusion training loss curves

v11 Training Progress

Epoch Val Total Val FAPE Val Dist MSE Val Bond Val Rg Val Aux DDIM TM DDIM RMSD Status
1 4.338 1.950 1.163 0.171 1.352 0.475 0.099 15.46Å NEW BEST

Loss Reference & Targets

Metric w Type E1 Val Target / Interpretation
FAPE 1.0 L1 1.950 Primary metric. v10 ceiling=1.31, untrained >2.0. <1.0 = correct folds. Must be ≥10% below E1 by E5.
Dist MSE 1.0 MSE 1.163 Pairwise Cα distance error. <0.1 = sub-Å accuracy.
Bond 3.0* MSE 0.171 Cα–Cα bond error. Already near-solved at E1. *annealed 1→3.
Rg 0.5 MSE 1.352 Radius of gyration error. Expect rapid convergence.
Chirality 0.1 MSE 0.509 Backbone handedness. Random ~0.54.
Angle 0.5 MSE 0.558 Cα bond angles. Random ~0.70.
Clash 0.1 penalty 0.001 Re-enabled in v11 (fixed-scale makes threshold meaningful).
Aux Ordinal 0.03 ordinal 0.475 Independent pair stack + ordinal regression. Stable (no v10 divergence).
TM-score DDIM 0.099 50-step DDIM. Target: >0.15 by E5, >0.30 by E10. >0.17 = recognizable folds.
RMSD DDIM 15.46Å <10Å = partial fold. <5Å = high quality.
v11 Epoch 1 — First Signal (20 min on A40):
  • val FAPE = 1.950 — already below the untrained baseline (~2.0+). The IPA frame machinery is learning backbone geometry from E1. This is the metric v10 could never improve.
  • val dist_mse = 1.163 — distance prediction converging; expect rapid improvement.
  • val bond = 0.171 — near-perfect Cα spacing already. Fixed-scale normalization working.
  • val aux = 0.475 — ordinal regression stable. No divergence (v10’s detached head diverged by E6).
  • DDIM: TM = 0.099, RMSD = 15.46Å — random level, expected for E1. Structural quality lags loss by several epochs.
Assessment: Strongly positive. Val FAPE below baseline at E1 confirms IPA can optimize frame consistency. Training/val gap healthy (no overfitting). Patience 0/15. E2 currently in progress (~20 min/epoch). Monitoring every 30 min.
Last updated: 2026-03-09 10:04 UTC — E1 complete, E2 in progress
v10 Historical Results (21 epochs, EGNN — superseded)

v10 used an 8-layer EGNN denoiser (14.6M params). After 21 epochs: dist_mse 60% below random (0.218), bond essentially solved (0.006), but FAPE stuck at random (~1.31) and TM-score peaked at 0.131. Two aux_dist_ce divergence incidents required intervention (pair detach at E5, full disable at E15). Cosine LR restarts caused pathological degradation at the minimum. Best val structural = 0.805 (E16). DDIM best: TM=0.131, RMSD=14.53Å.

V10 diffusion training loss curves (historical)

Diffusion v11 — IPA-Based Frame Denoising (Current)

Why the Pivot from v10

v10 used an 8-layer EGNN denoiser that updates Cα coordinates through distance-weighted pairwise messages. After 21 epochs on CATH, it achieved dist_mse 60% below random (learning “proteins are compact blobs of the right size”) but FAPE remained at its random baseline (~1.31) and TM-score peaked at 0.131 (random ~0.10). The model could not learn correct protein topology.

Root cause: EGNN has no concept of local reference frames. It passes messages based on pairwise distances and updates coordinates through distance-weighted vectors. FAPE measures frame-aligned point error — whether predicted local coordinate frames match ground truth frames — which EGNN has no inductive bias to optimize. Increasing w_fape on the old architecture means fighting the inductive bias.

v11 Overview

v11 replaces the EGNN denoiser with an Invariant Point Attention (IPA) structure module that explicitly maintains and refines per-residue rigid-body frames (rotation ∈ SO(3) + translation ∈ ℝ³). The architecture class is the same used in AlphaFold2’s structure module and RFDiffusion. Everything upstream (frozen ContactClassifier encoder, pair stack, Rg predictor) is carried forward from v10 — these components work.

8.4M total parameters (7.2M trainable, 1.2M frozen encoder). Training on CATH (18,024 train / 608 val, max 125 residues) on a single A40 GPU.

IPA Denoiser (6 layers)

Each IPA block performs three operations:

  1. Invariant Point Attention — standard multi-head attention on the single representation, augmented with (a) pair bias from the contact-conditioned pair stack and (b) 3D point attention: each head generates query/key/value points in ℝ³ that are transformed into each residue’s local frame. Attention weights depend on geometric distances between learned points — invariant to global rotation/translation.
  2. Transition MLP — 2-layer feedforward on the single representation.
  3. Frame update — predicts a small quaternion + translation update per residue, composed in the local frame (right-multiplication for SE(3) equivariance). Initialized near-zero so frames are approximately preserved in early training.

The pair representation is static through the IPA stack (not updated). If FAPE stalls after E15, adding outer-product pair updates is the first planned intervention.

SNR-Gated Frame Initialization

Per-residue rigid frames are built from noised Cα coordinates via Gram-Schmidt orthogonalization on consecutive backbone triplets. At high noise (σt where SNR < 1.0, roughly t > 700), the noised coordinates are near-isotropic and Gram-Schmidt is numerically unstable. The frame confidence mechanism smoothly blends toward identity frames:

$$\text{conf}(t) = \text{clamp}\!\left(\frac{\text{SNR}(t) - 0.2}{1.0 - 0.2},\, 0,\, 1\right), \qquad R_{\text{init}} = \text{slerp}(I,\, R_{\text{GS}},\, \text{conf})$$

The first IPA layer’s frame update is scaled by this confidence, so unreliable initial frames at high noise are attenuated rather than propagated.

Frame-Aware Self-Conditioning

50% of training steps: run a no-grad forward pass to get x0prev, build clean frames from it (treated as t=0), and use those as the initial frames for the second pass. At high noise where xt frames are identity, self-conditioning provides the model’s best guess at clean local geometry — the IPA layers refine good frames instead of building them from scratch. This is qualitatively more powerful than v10’s coordinate-only self-conditioning.

Fixed-Scale Coordinates

v11 replaces per-protein Rg normalization with division by a fixed constant (10Å). Why: Rg normalization caused the noise schedule to be protein-size-dependent. A protein with Rg=5Å had coordinate values ~1.0 after normalization while Rg=25Å gave ~0.2–0.5, meaning the same noise level destroyed more signal for larger proteins. This silently caps TM-scores and looks like a “plateau” rather than a systematic bias. All successful protein diffusion models (FrameDiff, RFDiffusion, Genie) use fixed-scale coordinates.

Independent Auxiliary Distance Head

v10’s DistanceHead read from the main pair stack through detach(), causing feature drift divergence (aux_dist_ce: 3.95→38.8 over 3 epochs). v11 uses a completely independent lightweight pair stack (64-dim, 2 triangle attention layers, ~500K params) reading from the frozen encoder outputs. Ordinal regression with 32 bins replaces 96-bin cross-entropy — adjacent-bin errors are penalized proportionally to distance, not as hard misclassifications.

v11 Loss Weights

Loss v10 v11 Change rationale
FAPE0.31.0IPA can actually optimize frame consistency
Bond5.03.0Gentler anneal (1→3 over E1-E10), avoids tug-of-war
Clash0.00.1Re-enabled: fixed-scale coords make 3.8Å threshold meaningful
Aux dist0.00.03Conservative start for new independent pair stack
Dist MSE1.01.0
Angle0.50.5
Chirality0.10.1
Rg0.50.5

v11 Training Protocol

LR: 5e-5 (halved from v10). Cosine decay to 1e-6 over 55 epochs with 3-epoch warmup, no restarts (v10 restarts caused pathological degradation at the LR minimum). 1000 diffusion timesteps (was 200). DDIM-50 evaluation every epoch. DDPM training with cosine noise schedule.

Phase 1 (E1-E10): FAPE must be ≥10% below E1 baseline by E5, ≥20% by E10. Phase 2 (E10-E40): Target TM-score > 0.3. Phase 3 (E40-E60): LR decay for final refinement and EMA stabilization.


Diffusion v10 — Architecture & Loss Function (Previous)

Overview

We train a denoising diffusion model for protein backbone (Cα) structure generation, conditioned on inter-residue contact maps predicted by a frozen ContactClassifier encoder. The model operates in Rg-normalized coordinate space: all coordinates are divided by the radius of gyration so the diffusion process is scale-invariant. The denoiser is an 8-layer SE(3)-equivariant graph neural network (EGNN) with 14.6M parameters (13.4M trainable, 1.2M frozen encoder). Training uses the CATH dataset (18,024 train / 608 val proteins, max 125 residues).

Total Loss

The total loss is a weighted combination of eight components. At epoch \(e\):

$$\mathcal{L}_{\text{total}} = w_{\text{dist}} \cdot \mathcal{L}_{\text{dist}} + \beta(e) \cdot \mathcal{L}_{\text{bond}} + w_{\text{aux}} \cdot \mathcal{L}_{\text{aux}} + w_{\chi} \cdot \mathcal{L}_{\chi} + w_{\theta} \cdot \mathcal{L}_{\theta} + w_{\text{fape}} \cdot \mathcal{L}_{\text{fape}} + w_{\text{rg}} \cdot \mathcal{L}_{\text{rg}}$$

Clash loss and auxiliary distance CE are logged but excluded (\(w_{\text{clash}} = 0\), \(w_{\text{aux}} = 0\)). See incident reports for rationale.

1. Distance MSE  \(w_{\text{dist}} = 1.0\)

Mean squared error on all pairwise Cα distances in Rg-normalized space:

$$\mathcal{L}_{\text{dist}} = \frac{1}{|\mathcal{M}|} \sum_{(i,j) \in \mathcal{M}} \left( \| \hat{x}_i^{(0)} - \hat{x}_j^{(0)} \| - \| x_i^{(0)} - x_j^{(0)} \| \right)^2$$

where \(\hat{x}^{(0)}\) is the predicted clean structure, \(x^{(0)}\) is the ground truth, both in Rg-normalized coordinates, and \(\mathcal{M}\) is the set of valid residue pairs. Clamped to max 10.0.

Random baseline: ~0.54 (MSE of Gaussian noise pairwise distances vs true).

2. Bond Geometry  \(\beta(e) = \min(5.0,\; 1.0 + 4.0 \cdot \min(e/15, 1))\)

MSE on consecutive Cα–Cα distances against the ideal 3.8Å bond length (in Rg-normalized space):

$$\mathcal{L}_{\text{bond}} = \frac{1}{L-1} \sum_{i=1}^{L-1} \left( \| \hat{x}_i^{(0)} - \hat{x}_{i+1}^{(0)} \| - \frac{3.8}{R_g} \right)^2$$

The weight is annealed from 1.0 to 5.0 over the first 15 epochs. Starting low prevents bond geometry from dominating early training when the model hasn't learned global structure. As training matures, the increasing weight enforces physically valid backbone geometry. Clamped to max 10.0.

Random baseline: ~0.17. Below 0.02 indicates bonds are within 0.1Å of the ideal 3.8Å spacing.

3. FAPE (Frame-Aligned Point Error)  \(w_{\text{fape}} = 0.3\)

Measures local structural consistency by constructing rigid frames from consecutive Cα triplets and computing the error in each frame's local coordinate system:

$$\mathcal{L}_{\text{fape}} = \frac{1}{N_f \cdot L} \sum_{f=1}^{N_f} \sum_{j=1}^{L} \min\!\Big( \| R_f^{\top}(\hat{x}_j - o_f) - R_f^{*\top}(x_j - o_f^*) \|,\; d_{\text{clamp}} \Big)$$

Frames are built from every other triplet of Cα atoms: the x-axis along \(c_2 - c_0\), z-axis from the cross product, y-axis completing the right-handed system. Unlike AlphaFold's random 14-residue sampling, v10 uses all residues as targets for stable gradients. Clamped at \(d_{\text{clamp}} = 10.0\).

Random baseline: ~1.31. Drops below 1.0 when the model learns correct fold topology. This is the hardest loss to reduce because it requires global structural correctness, not just local geometry.

4. Radius of Gyration  \(w_{\text{rg}} = 0.5\)

MSE on log-transformed radius of gyration predictions for scale invariance:

$$\mathcal{L}_{\text{rg}} = \left( \log \hat{R}_g - \log R_g \right)^2$$

Since the denoiser works in Rg-normalized space, a separate MLP (\(\texttt{RgPredictor}\)) predicts the absolute radius of gyration from sequence embeddings. This allows recovering real-space coordinates at inference: \(x_{\text{real}} = \hat{R}_g \cdot \hat{x}_{\text{norm}}\).

Random baseline: ~0.23. Converges below 0.05 by E2.

5. Chirality  \(w_{\chi} = 0.1\)

MSE on normalized signed volumes (scalar triple products) of consecutive Cα quartets, ensuring correct backbone handedness:

$$\mathcal{L}_{\chi} = \frac{1}{L-3} \sum_{i=1}^{L-3} \left( \frac{\mathbf{v}_1 \cdot (\mathbf{v}_2 \times \mathbf{v}_3)}{\|\mathbf{v}_1\| \|\mathbf{v}_2\| \|\mathbf{v}_3\|} \bigg|_{\hat{x}} - \frac{\mathbf{v}_1 \cdot (\mathbf{v}_2 \times \mathbf{v}_3)}{\|\mathbf{v}_1\| \|\mathbf{v}_2\| \|\mathbf{v}_3\|} \bigg|_{x} \right)^2$$

where \(\mathbf{v}_1 = x_{i+1} - x_i\), \(\mathbf{v}_2 = x_{i+2} - x_{i+1}\), \(\mathbf{v}_3 = x_{i+3} - x_{i+2}\). The normalization maps volumes to \([-1, 1]\), making the loss scale-invariant. Natural proteins are L-amino acids with consistent chirality; this loss prevents mirror-image structures.

Random baseline: ~0.54 (expected MSE for random values in [-1, 1]).

6. Bond Angle  \(w_{\theta} = 0.5\)

MSE on cosines of Cα–Cα–Cα bond angles for consecutive triplets:

$$\mathcal{L}_{\theta} = \frac{1}{L-2} \sum_{i=1}^{L-2} \left( \cos\hat{\theta}_i - \cos\theta_i \right)^2, \quad \cos\theta_i = \frac{(\hat{x}_{i+1} - \hat{x}_i) \cdot (\hat{x}_{i+2} - \hat{x}_{i+1})}{\| \hat{x}_{i+1} - \hat{x}_i \| \| \hat{x}_{i+2} - \hat{x}_{i+1} \|}$$

The ideal Cα–Cα–Cα angle in proteins is ~120° (\(\cos\theta \approx -0.5\)). Working in cosine space avoids discontinuities at 0°/360°.

Random baseline: ~0.70. Below 0.1 indicates correct backbone angles.

7. Auxiliary Distance Cross-Entropy  \(w_{\text{aux}} = 0.0\) (disabled)

Cross-entropy loss on binned pairwise distances, predicted by a DistanceHead MLP from the detached pair representation:

$$\mathcal{L}_{\text{aux}} = -\frac{1}{|\mathcal{M}'|} \sum_{(i,j) \in \mathcal{M}'} \log p_{ij}\big[\text{bin}(d_{ij})\big], \quad \text{bin}(d) = \left\lfloor \frac{d - d_{\min}}{\Delta} \right\rfloor, \quad \Delta = \frac{d_{\max} - d_{\min}}{N_{\text{bins}}}$$

Parameters: \(N_{\text{bins}} = 96\), \(d_{\min} = 2\)Å, \(d_{\max} = 40\)Å, \(\Delta = 0.396\)Å/bin. Distances are computed in real space (\(d_{ij} = R_g \cdot \|x_i - x_j\|\)) then binned. Only pairs within \(d_{\max}\) are included.

The pair representation is detached before entering the DistanceHead, meaning aux gradients train only the MLP head, not the pair stack. This is critical — an earlier experiment without detach caused exponential divergence (see incident report in Training tab).

Random baseline: \(\ln(96) = 4.56\). A uniform predictor assigns \(1/96\) to each bin, giving \(-\ln(1/96) = \ln(96)\). Weight history: 0.3 (E1–E13) → 0.03 (E14) → 0.0 (E15+, disabled). Disabled because the pair representation evolves via structural losses while the DistanceHead observes it through a stop-gradient wall, causing irrecoverable feature drift and exponential CE divergence. Still computed and logged as a diagnostic.

8. Clash Loss  \(w_{\text{clash}} = 0.0\) (disabled)

Contact-guided steric clash penalty on non-bonded atoms closer than 3.0Å:

$$\mathcal{L}_{\text{clash}} = \frac{1}{|\mathcal{N}|} \sum_{(i,j) \in \mathcal{N}} \left[ (1 + 2 \cdot c_{ij}) \cdot \text{ReLU}(3.0 - d_{ij}) \right]^2$$

where \(\mathcal{N}\) is the set of non-bonded pairs (\(|i - j| > 1\)), \(c_{ij}\) is the predicted contact probability (detached), and \(d_{ij}\) is the Rg-normalized distance.

Why disabled: The 3.0Å threshold is applied in Rg-normalized coordinates, but for a typical protein with \(R_g \approx 10\)Å, this maps to \(3.0 \times 10 = 30\)Å in real space — penalizing nearly all non-bonded pairs. This produced noisy gradients that dominated ~47% of the total loss in v9, drowning the structural learning signal. Removing clash led to immediate training stability and the structural losses handle steric quality implicitly (clash decreases organically as the model generates better structures).

Training Configuration

OptimizerAdamW (\(\beta_1=0.9, \beta_2=0.999\))
Peak learning rate\(10^{-4}\)
LR schedule5-epoch linear warmup (\(0.01 \times \text{lr} \to \text{lr}\)), then CosineAnnealingWarmRestarts (\(T_0 = 15\) epochs, \(\eta_{\min} = 10^{-5}\))
Batch size8 (grad accumulation = 2, effective = 16)
Mixed precisionAMP with GradScaler, \(\texttt{dx\_clamp} = 0.5\)
EMADecay = 0.999, used for DDIM evaluation
Self-conditioning50% probability during training
Early stoppingPatience = 15 epochs on val structural loss (excludes aux_dist_ce)
DDIM evaluation50 steps, every 3 epochs, using EMA weights
HardwareSingle NVIDIA A40 (48 GB), ~15 min/epoch
DatasetCATH 4.3: 18,024 train / 608 val, max 125 residues

DDIM Evaluation Metrics

Every 3 epochs, we generate structures via 50-step DDIM sampling using EMA weights and evaluate against ground truth:

  • TM-score (Template Modeling): Global fold similarity, range [0, 1]. Scores > 0.17 indicate the same fold; > 0.5 is the same topology. Random structures score ~0.10.
  • RMSD (Root Mean Square Deviation): Average atomic displacement after optimal superposition. Random placement gives ~15–16Å. Below 5Å is high quality.
  • GDT (Global Distance Test): Fraction of residues within 1–8Å of the true position. Random gives ~3–4%.