A collection of informal reviews of papers I find interesting — mostly in the protein
structure prediction, protein design, and protein language model space. These are from
Sergey Ovchinnikov's lab
and related groups. Just my thoughts, nothing too formal.
Protein Diffusion Models as Statistical Potentials
Roney, Ou, Ovchinnikov · bioRxiv 2025
What if we could repurpose protein diffusion models as energy functions?
ProteinEBM does exactly that — turning a generative model into a scoring function
that can rank structures, predict conformational landscapes, and estimate mutation effects.
Protein Diffusion Models as Statistical Potentials
James P. Roney, Chenxi Ou, Sergey Ovchinnikov · bioRxiv, 2025
Alright, so here's the setup. Diffusion models have been crushing it in protein structure generation — you've probably heard of RFdiffusion and friends. But this paper asks a really clever question: what if instead of just using these models to generate new proteins, we could use them as energy functions? Like, what if the model's learned score function could tell us how "good" a given protein conformation is?
That's basically what ProteinEBM is. They take a pre-trained protein diffusion model and derive an energy-based model from it. The key insight is that the denoising process in a diffusion model implicitly learns a probability distribution over protein structures. If you can extract the energy landscape from that distribution, you've got yourself a statistical potential that knows what realistic protein conformations look like.
What's cool is how broadly useful this turns out to be. They show that ProteinEBM can rank the correctness of protein structures (which fold is more native-like?), do actual structure prediction from sequence, sample multiple conformational states of flexible proteins, and even predict the energetic effects of point mutations. And it's competitive with or better than both traditional physics-based force fields and other ML methods across these tasks.
The mutation effect prediction part is particularly interesting to me. There's been a ton of work on predicting delta-delta-G from sequence-based models, but having a structure-aware energy function that can do this feels like the right approach. You're literally asking "how does this mutation change the energy landscape?" which is what's actually happening physically.
The big picture here is that generative models contain way more information than we typically extract from them. We train them to generate, but the learned distribution itself is a rich model of protein physics. This paper is a nice reminder that sometimes the most interesting applications of a model aren't the ones it was originally trained for.
Read the paper
Designing Novel Solenoid Proteins with In Silico Evolution
Pretorius, Nikov, Washio, Florent, Taunt, Ovchinnikov, Murray · Communications Chemistry 2025
Solenoid proteins are nature's modular building blocks. This paper
uses AlphaFold2 as an oracle inside a genetic algorithm to design entirely
new solenoid folds — and 20% of them actually work in the lab.
Designing Novel Solenoid Proteins with In Silico Evolution
Daniella Pretorius, Georgi I. Nikov, Kono Washio, Steve-William Florent, Henry N. Taunt, Sergey Ovchinnikov, James W. Murray · Communications Chemistry (Nature), 2025
Solenoid proteins are one of those protein architectures that feel almost too elegant. They're built from repeating structural units — think of them like a spiral staircase made of protein. Nature uses them everywhere: from beta-helices in antifreeze proteins to TPR repeats that mediate protein-protein interactions. But the natural repertoire of solenoid topologies is limited. What if we could design new ones?
This paper takes a really fun approach. Instead of using the fancy generative models that everyone's been working on lately, they go old school: a genetic algorithm. But with a twist — they use AlphaFold2 as the fitness function. So you have a population of sequences that evolve over generations, and at each step, AlphaFold2 tells them which sequences fold into the desired solenoid topology. There's also a solenoid discrimination network that helps filter candidates.
What I like about this is the simplicity of the concept. You don't need to train a specialized generative model. You just need a good oracle (AlphaFold2) and a search strategy (the genetic algorithm). It's kind of like directed evolution, but entirely in silico. And it works — they designed alpha-solenoids, beta-solenoids, and alpha-beta solenoids, spanning both natural topologies and folds that haven't been seen in nature.
The experimental validation is solid too. They picked a handful of designs, expressed them in E. coli, and got five that were highly stable. That's a 20% experimental success rate, which is actually pretty good for de novo protein design. It's not 100%, but for a method that's essentially guessing-and-checking with AlphaFold2, it's encouraging.
One thing that struck me is how this approach reveals that the space of possible solenoid topologies is larger than what nature has explored. There are stable folds out there that evolution just never stumbled upon. That's a pretty cool thought — we're not just redesigning nature, we're expanding the protein universe.
Read the paper
CIRPIN: Learning Circular Permutation-Invariant Representations to Uncover Putative Protein Homologs
Kolodziej, Abulnaga, Ovchinnikov · bioRxiv 2025
Most structure comparison tools miss proteins that are related by circular permutation.
CIRPIN fixes this with a clever graph neural network that doesn't care where the chain starts —
uncovering thousands of hidden evolutionary relationships.
CIRPIN: Learning Circular Permutation-Invariant Representations
Aiden R. Kolodziej, S. Mazdak Abulnaga, Sergey Ovchinnikov · bioRxiv, 2025
Here's a problem that's been hiding in plain sight for years. Say you have two proteins that fold into essentially the same 3D shape, but one of them has its chain "rewired" — the N-terminus of one lines up with some internal position of the other. This is called circular permutation, and it's surprisingly common in nature. Proteins can evolve to swap where their chain starts and ends while keeping the same overall fold.
The problem? Most of our structural comparison tools — things like TM-align, Foldseek, DALI — are sensitive to chain topology. They compare structures in a way that cares about which residue comes first, second, third in the linear sequence. So two circularly permuted proteins that look basically identical in 3D can score as completely unrelated. That's a pretty big blind spot.
CIRPIN tackles this head-on with a graph neural network that's explicitly designed to be invariant to circular permutations. The clever bit is the data augmentation: during training, they synthetically circularly permute proteins so the model learns representations that are the same regardless of where you cut the chain. It's one of those ideas that seems obvious in hindsight but actually requires careful implementation.
The results are striking. They search through the SCOPe database and the AlphaFold Cluster Representatives and find thousands of novel circularly permuted pairs that previous methods completely missed. My favorite finding is that PDZ domains — one of the most studied protein families in structural biology — naturally exist in four distinct circularly permuted forms. Four! And we didn't know about some of these until CIRPIN found them.
This matters because circular permutation isn't just a structural curiosity. It can change a protein's function, stability, and interactions. If our homology detection tools are systematically missing these relationships, we're leaving a lot of biology on the table. CIRPIN fills that gap nicely.
Read the paper
Hit or Miss: Understanding Emergence and Absence of Homo-oligomeric Contacts in Protein Language Models
Zhang, Akiyama, Cho, Jajoo, Ovchinnikov · bioRxiv 2025
Protein language models are trained on single chains, yet they somehow learn about
protein-protein interfaces. This paper digs into how and why — and finds that
bigger models keep getting better at inter-chain contacts even after intra-chain accuracy plateaus.
Hit or Miss: Understanding Emergence and Absence of Homo-oligomeric Contacts in Protein Language Models
Zhidian Zhang, Yo Akiyama, Yehlin Cho, Samarth Jajoo, Sergey Ovchinnikov · bioRxiv, 2025
This one scratches a theoretical itch I've had for a while. Protein language models like ESM-2 are trained on individual protein sequences — single chains, one at a time. They never see protein complexes during training. And yet, when you extract attention maps or contact predictions from these models, they somehow know about inter-subunit contacts in homo-oligomers (proteins that assemble into copies of themselves). How?
The authors dig into this with a really systematic analysis. First, the scaling behavior is fascinating: as you make the model bigger, its ability to predict inter-chain contacts keeps improving, even after single-chain contact prediction accuracy has plateaued. That's a striking emergent property — the model discovers quaternary structure information as a byproduct of learning more about individual sequences.
There's also a cool comparison between ESM-2 and MSA-based approaches (MSA Pairformer). When you restrict multiple sequence alignments to close evolutionary neighbors, the MSA approach edges out ESM-2 on interface contact recovery (0.44 vs 0.33). But ESM-2 doesn't need any alignment at all — it works from a single sequence. So there's a tradeoff between the coevolutionary signal you get from alignments and the implicit knowledge baked into a massive language model.
One really practical result: the largest ESM-2 model can accurately distinguish genuine biological interfaces from crystal packing artifacts. This is a classic problem in structural biology — when you solve a crystal structure, you see contacts between protein copies in the crystal lattice, and it's not always obvious which of those contacts are biologically real. Having a sequence-based model that can make this call is genuinely useful.
The "hit or miss" framing in the title is apt. The model doesn't learn all interfaces equally. Some it nails, others it completely misses. Understanding which ones and why is where the real mechanistic insight lies, and this paper makes good progress on that front.
Read the paper
Assessing the Utility of Coevolution-Based Residue–Residue Contact Predictions in a Sequence- and Structure-Rich Era
Kamisetty, Ovchinnikov, Baker · PNAS 2013
The 2013 paper that helped establish when coevolution-based contact
prediction is actually useful. A foundational work that set the stage for
everything from direct coupling analysis to AlphaFold.
Assessing the Utility of Coevolution-Based Contact Predictions
Hetunandan Kamisetty, Sergey Ovchinnikov, David Baker · PNAS, 2013
Going back to 2013 for this one, and honestly, reading it now with the benefit of hindsight is wild. This is from the era before AlphaFold, before protein language models, before any of the deep learning revolution in structural biology. The big question was: can we use the patterns of correlated mutations in multiple sequence alignments to predict which residues are in physical contact in a protein's 3D structure?
The answer had been "sort of, sometimes" for a while. What this paper does is lay out clearly when coevolution-based contact prediction actually works well and when it doesn't. The key finding is a simple but powerful rule of thumb: you need at least 5L non-redundant aligned sequences (where L is the protein length) for the predictions to be reliable. Below that threshold, the signal-to-noise ratio just isn't good enough.
They also make a practical argument about when accurate contacts are most useful for structure modeling. If you already have a close homolog with a known structure, the contacts don't add much — you can just thread your sequence onto the known template. But when your best structural template is distant, predicted contacts can bridge the gap and dramatically improve models. It's about finding the sweet spot where you have enough sequences for coevolution but not enough structural information from homology alone.
What makes this paper historically important is that it helped establish the credibility of coevolution-based methods at a time when many people were skeptical. The computational community needed convincing that these predictions were accurate enough to be useful, and this paper provided the benchmarks and the practical guidelines that moved the field forward.
Looking back, this work was laying the intellectual groundwork for everything that followed. The idea that evolutionary information contains structural information — that patterns in sequences encode 3D contacts — is the same insight that powers AlphaFold's Evoformer and the MSA processing in ESMFold. The methods got fancier, but the core principle established here still holds.
Read the paper
More papers I find interesting
De Novo Design of Protein Structure and Function with RFdiffusion
Watson, Juergens, Bennett et al. · Nature 2023
The paper that brought diffusion models to protein design in a big way.
RFdiffusion generates protein backbones from scratch and can design binders,
symmetric assemblies, and enzyme scaffolds — many validated experimentally.
De Novo Design of Protein Structure and Function with RFdiffusion
Joseph L. Watson, David Juergens, Nathaniel R. Bennett et al. · Nature, 2023
If there's one paper that made everyone in structural biology sit up and pay attention to diffusion models, it's this one. RFdiffusion takes the RoseTTAFold structure prediction architecture and fine-tunes it on a denoising task: starting from random noise, iteratively denoise until you get a plausible protein backbone. It's the same idea as image diffusion models (Stable Diffusion, DALL-E), but for 3D protein structures.
What makes this paper really stand out isn't just the method — it's the sheer range of design challenges they tackle. Unconditional monomer design? Check. Topology-constrained design? Check. Protein binder design where you specify a target surface and the model hallucates a complementary protein? Check. Symmetric homo-oligomer design? Also check. Enzyme active site scaffolding? Yep, that too.
And they don't just design things computationally. There's extensive experimental validation. They express hundreds of designs, characterize them with various biophysical methods, and even solve cryo-EM structures of some symmetric assemblies. The designed binders actually bind. The symmetric assemblies actually assemble. It's not just pretty pictures from a generative model — these things work in the real world.
The underlying trick is elegant: by training on the denoising objective, the model learns a gradient field over protein structure space that points "uphill" toward realistic folds. Conditioning the denoising on different constraints (target surface, symmetry operations, motif coordinates) lets you steer generation toward specific design goals. It's a really flexible framework.
For my own work on pMHC design, RFdiffusion is the backbone generation engine. So I have a personal stake in understanding this paper well. The conditioning mechanism is what lets us specify MHC groove geometry and get peptide backbones that fit — it's a direct application of the tools developed here.
Read the paper
Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model (ESMFold)
Lin, Abanades, Rao, Johnson, Rives et al. · Science 2023
What if you could predict protein structure from a single sequence, no alignment needed?
ESMFold does this at AlphaFold-like accuracy with a 15 billion parameter language model,
enabling structure prediction for 600+ million metagenomic proteins.
ESMFold: Protein Structure from a Language Model
Zeming Lin, Halil Akin, Roshan Rao, Jeff Johnson, Alexander Rives et al. · Science, 2023
AlphaFold2 was a revolution, but it has a practical bottleneck: it needs multiple sequence alignments (MSAs). For every protein you want to fold, you first have to search databases, build an alignment, and then run the model. This is slow, and for orphan proteins with few homologs, the alignments can be thin and noisy.
ESMFold's pitch is simple: what if a large enough language model, trained on enough protein sequences, could learn the evolutionary information implicitly? No alignment needed — just give it a single sequence and it predicts the structure. The language model (ESM-2, up to 15 billion parameters) acts as a replacement for the entire MSA pipeline.
The accuracy is impressive. It doesn't quite match AlphaFold2 on every benchmark, but it gets close enough to be extremely useful, especially because it's orders of magnitude faster. And for proteins where building a good MSA is hard or impossible (think metagenomic sequences with no close relatives), ESMFold can still make predictions. That's a huge practical advantage.
The real flex in this paper is the ESM Metagenomic Atlas. They ran ESMFold on 617 million metagenomic protein sequences and predicted structures for all of them. This was simply not feasible with AlphaFold2's MSA requirement. The resulting atlas reveals an enormous amount of structural diversity in environmental sequences — folds that we've never seen in lab-characterized proteins.
For the protein language model community, ESMFold is an important proof of concept. It shows that the representations learned by masked language modeling on protein sequences contain enough structural information to fold proteins. The model isn't just learning sequence patterns — it's learning physics. That's conceptually very satisfying and has big implications for how we think about what these models are actually capturing.
Read the paper
Simulating 500 Million Years of Evolution with a Language Model (ESM3)
Hayes, Rao, Akin et al. · Science 2025
ESM3 is a 98-billion-parameter multimodal model that reasons over protein sequence,
structure, and function simultaneously. It designed a novel fluorescent protein with
only 58% identity to anything in nature — equivalent to 500 million years of evolution.
Simulating 500 Million Years of Evolution with ESM3
Thomas Hayes, Roshan Rao, Halil Akin et al. · Science, 2025
Okay, so this paper is kind of a big deal. ESM3 is a 98-billion-parameter model that doesn't just look at protein sequences — it simultaneously reasons over sequence, structure, and function. It's multimodal in the protein sense: trained on 3.15 billion sequences, 236 million structures, and 539 million function annotations. That's 771 billion tokens total.
The headline result is esmGFP, a computationally designed fluorescent protein that has only 58% sequence identity to any known fluorescent protein. To put that in perspective, natural GFP variants across all of biology (jellyfish, corals, sea anemones) typically share more than 58% identity with each other. The authors estimate this level of divergence would take over 500 million years of natural evolution to achieve. And yet, when they expressed esmGFP in the lab, it fluoresced. It actually worked.
That result is remarkable, but let's be honest about what it means and doesn't mean. The model didn't "simulate evolution" in any mechanistic sense — it's not running a phylogenetic process. What it did was learn the manifold of viable protein sequences well enough to navigate far from known sequences while staying on the functional surface. That's impressive in a different way. It means the model has a deep enough understanding of protein fitness landscapes to make large creative leaps.
The multimodal training is key here. By jointly learning from sequence, structure, and function data, ESM3 can be prompted in any combination of these modalities. Want a protein with a specific fold? Prompt with structure. Want a specific function? Prompt with function tokens. Want both? Prompt with both. This flexibility is what enables the GFP design — they could specify "be fluorescent" and "fold like a beta-barrel" and let the model figure out a sequence that satisfies both constraints.
The scale is also hard to ignore. 98 billion parameters is massive, and the training data spans essentially all known protein diversity. Whether this approach scales further (can you design even more novel proteins with even bigger models?) is an open question, but ESM3 suggests the answer is probably yes.
Read the paper
Accurate Structure Prediction of Biomolecular Interactions with AlphaFold 3
Abramson, Adler, Dunger et al. · Nature 2024
AlphaFold 3 moves beyond proteins to predict the structures of complexes involving
DNA, RNA, small molecules, and ions — with a diffusion-based architecture that
substantially outperforms specialized tools for drug-like interactions.
AlphaFold 3: Biomolecular Interactions, Not Just Proteins
Josh Abramson, Jonas Adler, Jack Dunger et al. · Nature, 2024
AlphaFold 2 was a protein structure prediction tool. AlphaFold 3 is something broader — it's a biomolecular interaction prediction tool. The difference matters. Biology doesn't run on isolated proteins; it runs on proteins bound to DNA, RNA, small molecules, metal ions, and each other. AF3 handles all of these in a unified framework.
The architectural change is significant. AF2 used an iterative recycling mechanism with specialized modules for MSA processing and pairwise reasoning. AF3 replaces much of this with a diffusion-based approach. The model predicts atomic coordinates directly through a denoising process, which turns out to be more natural for handling the diversity of molecular types (you can't really use the same residue-level representation for a protein and a small molecule).
The results on protein-ligand interactions are what caught the drug discovery community's attention. AF3 substantially outperforms established docking tools for predicting how drugs bind to proteins. That's a big deal because molecular docking has been a cornerstone of computational drug discovery for decades, and having a learned model surpass physics-based approaches suggests a real paradigm shift.
For protein-nucleic acid interactions, AF3 also beats specialized tools. And for antibody-antigen prediction — notoriously one of the hardest problems in structural biology because CDR loops are so flexible — it significantly improves over AF2-Multimer. These are all areas where having accurate structural predictions has immediate practical applications.
The catch (there's always a catch) is that diffusion-based structure prediction introduces stochasticity. You get different predictions on different runs, which is both a feature (you can sample conformational diversity) and a complication (you need to run multiple seeds and assess confidence carefully). But overall, AF3 represents a genuine step forward in our ability to model the full complexity of cellular molecular machinery.
Read the paper
Protein Language Models Learn Evolutionary Statistics of Interacting Sequence Motifs
Zhang, Wayment-Steele, Brixi, Wang, Kern, Ovchinnikov · PNAS 2024
What do protein language models actually learn? This paper shows ESM-2
stores coevolutionary statistics as motifs of pairwise contacts —
bridging the gap between classical coevolution and modern deep learning.
What Do Protein Language Models Actually Learn?
Zhidian Zhang, Hannah K. Wayment-Steele, Garyk Brixi, Haobo Wang, Dorothee Kern, Sergey Ovchinnikov · PNAS, 2024
This paper is important because it tries to open the black box. We all know that protein language models like ESM-2 learn useful representations — you can predict structure, function, fitness effects, and all sorts of things from their embeddings. But what exactly are they learning? Is it deep, emergent reasoning about protein physics? Or something simpler?
The answer, at least for contact prediction, turns out to be more interpretable than you might expect. The authors develop an unsupervised method to probe what ESM-2 has learned and find that the model stores statistics of coevolving residues — essentially the same coevolutionary signal that classical methods like direct coupling analysis (DCA) extract from multiple sequence alignments. But instead of computing these statistics from alignments at inference time, ESM-2 has memorized them during pre-training.
Even more specifically, ESM-2 predicts contacts by storing "motifs" — small groups of pairwise contacts that tend to co-occur. It's not doing some deep chain of reasoning across the entire protein. It's more like pattern matching against a library of local contact patterns. This is both reassuring (the model is learning something real and interpretable) and a little humbling (it's not doing anything as sophisticated as we might have imagined).
One surprising finding: ESM-2 doesn't actually need the full sequence context to predict inter-residue contacts. You can mask or truncate large portions of the sequence and the model still predicts contacts for the remaining region. This suggests the model is doing local pattern recognition rather than global inference, which is consistent with the motif-based explanation.
For the field, this kind of mechanistic understanding is crucial. If we know what the model learns, we can design better models, identify failure modes, and figure out what information is still missing. It also connects the protein language model revolution back to the coevolution-based methods that came before — showing that there's a continuous intellectual thread from DCA to transformer attention maps.
Read the paper
♘ Chess
I've been playing chess since I moved to the US. I mainly play rapid and blitz on Lichess.
Feel free to challenge me!
💻 Open Source
Building tools at the intersection of ML and biology. Check out my projects on GitHub.
Last updated: March 7, 2026
Goal: Train a transformer encoder from scratch on CATH 4.2 dataset (18k proteins) to jointly predict CATH class/architecture labels AND inter-residue contact maps from sequence alone. The learned embeddings encode spatial proximity information needed for Stage 2.
Architecture
ContactClassifier — 1.2M params, dim=128, 2 transformer towers, d_pair=64, 2 contact prediction blocks with outer product mean
Training
Single GPU (1080 Ti), batch_size=24, lr=2e-4 with warmup cosine schedule, patience=15 early stopping
Resilience
Per-epoch checkpoints with auto-resume, CSV loss logging, self-resubmitting watchdog system on SLURM
Training Progress (all 25 epochs — early stopped)
| Epoch |
Val Total Loss |
Train Class Acc |
Val Class Acc |
Train Arch Acc |
Contact Recall (Val) |
Contact BCE (Val) |
LR |
Status |
| 1 |
4.846 |
47.5% |
41.8% |
16.1% |
69.6% |
0.759 |
3.33e-05 |
NEW BEST |
| 2 |
4.626 |
58.2% |
54.9% |
25.4% |
72.3% |
0.725 |
6.67e-05 |
NEW BEST |
| 3 |
4.454 |
65.9% |
53.8% |
33.8% |
73.9% |
0.702 |
1.00e-04 |
NEW BEST |
| 4 |
4.356 |
70.0% |
58.4% |
39.2% |
73.8% |
0.693 |
1.33e-04 |
NEW BEST |
| 5 |
4.354 |
72.9% |
60.7% |
42.4% |
73.7% |
0.691 |
1.67e-04 |
NEW BEST |
| 6 |
4.220 |
74.4% |
65.1% |
43.7% |
75.5% |
0.668 |
2.00e-04 |
NEW BEST |
| 7 |
4.321 |
75.9% |
64.8% |
44.5% |
75.9% |
0.686 |
2.00e-04 |
pat 1 |
| 8 |
3.994 |
77.3% |
65.5% |
46.4% |
77.4% |
0.660 |
1.99e-04 |
NEW BEST |
| 9 |
3.998 |
77.2% |
66.0% |
47.0% |
78.0% |
0.665 |
1.98e-04 |
pat 1 |
| 10 |
3.988 |
78.1% |
66.3% |
47.8% |
77.3% |
0.659 |
1.97e-04 |
BEST (final) |
| 11 |
4.199 |
78.9% |
66.1% |
48.5% |
78.0% |
0.655 |
1.96e-04 |
pat 1 |
| 12 |
4.073 |
79.4% |
68.9% |
49.1% |
78.7% |
0.652 |
1.94e-04 |
pat 2 |
| 13 |
4.128 |
79.5% |
66.8% |
49.9% |
77.4% |
0.652 |
1.92e-04 |
pat 3 |
| 14 |
4.014 |
80.1% |
67.8% |
50.1% |
77.2% |
0.653 |
1.89e-04 |
pat 4 |
| 15 |
4.083 |
80.4% |
65.6% |
50.7% |
78.1% |
0.652 |
1.87e-04 |
pat 5 |
| 16 |
4.010 |
81.3% |
68.8% |
51.5% |
77.1% |
0.666 |
1.84e-04 |
pat 6 |
| 17 |
4.139 |
80.9% |
66.3% |
52.1% |
77.8% |
0.659 |
1.80e-04 |
pat 7 |
| 18 |
4.062 |
81.8% |
68.1% |
52.8% |
77.5% |
0.659 |
1.77e-04 |
pat 8 |
| 19 |
4.134 |
81.7% |
66.0% |
53.4% |
77.8% |
0.654 |
1.73e-04 |
pat 9 |
| 20 |
4.208 |
82.2% |
69.2% |
53.8% |
77.5% |
0.660 |
1.69e-04 |
pat 10 |
| 21 |
4.112 |
82.4% |
68.9% |
54.5% |
77.5% |
0.653 |
1.64e-04 |
pat 11 |
| 22 |
4.138 |
83.3% |
68.6% |
55.1% |
78.0% |
0.650 |
1.60e-04 |
pat 12 |
| 23 |
4.208 |
82.8% |
67.6% |
55.2% |
77.3% |
0.649 |
1.55e-04 |
pat 13 |
| 24 |
4.169 |
83.1% |
68.9% |
55.6% |
77.0% |
0.654 |
1.50e-04 |
pat 14 |
| 25 |
4.358 |
83.8% |
65.1% |
56.4% |
78.0% |
0.652 |
1.45e-04 |
EARLY STOP |
Final Results
- Best val loss: 3.988 at epoch 10 (best weights saved)
- Val class accuracy: 66.3% (4-way CATH class), Val architecture accuracy: 31.8% (38+ architectures)
- Contact recall: 77.3%, Contact BCE: 0.659 — model successfully learned spatial proximity from sequence
- Train class accuracy: 78.1%, Train arch accuracy: 47.8%
- Early stopped at epoch 25 (patience 15) — val loss plateaued after epoch 10
- 1.2M params, trained from scratch on CATH 4.2 (18k proteins)
- Training survived 7 SLURM job allocations with checkpoint resume
Training Curves
Contact Map Predictions (3 test proteins)
Each row shows a held-out test protein from a different CATH structural class. The left column is the ground truth contact map (binary: two Cα atoms < 8Å apart), and the right column is the model’s predicted probability of contact from sequence alone. Metrics (precision P, recall R, and Top-L long-range accuracy) are annotated on each prediction panel.
- 1bf0.A (L=60, Few Secondary Structure): A small protein with sparse, irregular contacts. The model captures the overall topology despite limited structural regularity.
- 3ggm.A (L=81, Mainly Beta): Beta-sheet proteins produce characteristic off-diagonal block patterns from strand–strand hydrogen bonding. The model recovers these long-range parallel and anti-parallel strand pairings well.
- 1f9x.A (L=120, Mainly Alpha): Alpha-helical proteins show strong banded diagonal patterns from helix-internal i→i+4 contacts. The model reproduces both the local helical periodicity and inter-helix contacts at larger separations.
These results demonstrate that a 1.2M-parameter transformer encoder trained from scratch on CATH 4.2 (~18k proteins) can learn meaningful spatial proximity signals across all major fold classes — without any pretrained language model or evolutionary information.
Learned Embedding Space (PCA & UMAP)
Attention-pooled protein embeddings (128-dim) from the encoder’s val+test set, projected via PCA and UMAP. The encoder learns to separate CATH classes without explicit contrastive loss — mainly-alpha and mainly-beta proteins form distinct clusters, while alpha-beta proteins span the intermediate region. UMAP reveals finer sub-structure at the architecture level, with several CATH architectures forming tight, well-separated clusters (e.g., 3.40 Rossmann fold, 1.10 orthogonal bundle). This confirms the multi-task training objective (classification + contact prediction) produces structurally meaningful representations suitable for conditioning the downstream diffusion model.
Goal: Generate realistic protein backbone structures (Cα coordinates) conditioned on sequence, using contact-aware embeddings from Stage 1.
Architecture
ContactConditionedPairStack + Denoiser9 (8 EGNN layers, dx_clamp=0.5) + RgPredictor — ~14.6M params total (1.2M frozen encoder + 13.4M trainable)
Key innovations over v8
Contact map probabilities injected via gated projection into pair representation; contact attention bias in denoiser; FAPE loss (frame-aligned point error); predicted radius of gyration instead of fixed Rg=10; 8 denoiser layers (was 6), dx_clamp=0.5
v8 Baseline
- Val loss 2.086, RMSD ~14.5Å, TM-score ~0.13
- The information bottleneck from the frozen classifier was identified as the main limitation
Run 1 (Epochs 1–22) — Failed
- Best val loss 4.82 at epoch 9 (49% reduction from E1), then catastrophic instability at E10–11: bond geometry spiked from 0.14 to 2.4, clash from 5.3 to 12.8.
- Root causes identified: (1) bond weight annealed in the wrong direction (high→low instead of low→high), (2) cosine LR decayed to 1e-6 floor leaving no recovery capacity, (3) dx_clamp=1.0 amplified instability across 8 layers, (4) loss clamps at 100.0 too generous, (5) clash weight too low (0.5).
- DDIM sampling never exceeded TM=0.041 (E5). Model early-stopped at E24.
Run 2 (Epochs 1–5) — Failed
- Applied 7 fixes from Run 1 diagnosis (reversed bond annealing, dx_clamp=0.5, tighter clamping, aux loss, EMA DDIM, etc.).
- w_clash=2.0 dominated the total loss (~4.8 of ~11 total), drowning out structural signals. LR lacked warmup and started decaying immediately.
- Val total spiked at E5 (11.0→14.9). DDIM TM declined from 0.035 (E1) to 0.025 (E3). Cancelled after 5 epochs.
v9 Run 3 (Epochs 1–7) — Failed
- Best: E1 DDIM TM=0.097, RMSD=16.3A (best ever). Val loss improved to 8.33 by E4.
- E7 val total spiked to 19.6. Clash loss still dominated (~47% of total at w_clash=1.0). DDIM TM declined steadily: 0.097→0.049→0.028.
- Root cause: clash threshold of 3.0A applied in Rg-normalized coordinates maps to ~30A in real space, penalizing nearly all non-bonded pairs with noisy gradients.
v10 — Training Progress (current)
v10 changes (final)
- Clash loss removed entirely (was broken in Rg-normalized coordinate space: 3.0Å threshold mapped to ~30Å real space, penalizing nearly all atom pairs with noise)
- FAPE now uses all residues as targets (was random 16 out of 125, causing high gradient variance epoch-to-epoch)
DistanceHead pair input undetached — reverted at E5 (see incident report below)
- Auxiliary distance CE disabled (
w_aux=0.0) at E15 — pair representation drift caused exponential divergence (see second incident report below)
- Patience tracking switched to structural-only loss (excludes aux contribution) to prevent inflated val_total from triggering premature early stopping
- All prior fixes retained: reversed bond annealing (low→high over 15 epochs), dx_clamp=0.5, tighter loss clamping, 5-epoch linear warmup + cosine restarts
Incident Report: aux_dist_ce Divergence (E6–E12)
What happened: At E6, the auxiliary distance cross-entropy loss (aux_dist_ce) began diverging exponentially:
3.95 → 43.8 → 86.6 → 77.9, eventually reaching 79+ and saturating. By E12, the model was generating
random atom clouds (DDIM TM=0.015, RMSD=52.6Å). All structural losses were corrupted.
Root cause: Removing pair.detach() from the DistanceHead input allowed aux_dist_ce gradients
to flow back through the ContactConditionedPairStack. This created a positive feedback loop: aux gradients destabilized the
pair representation → worse distance predictions → larger aux loss → even larger gradients. The pair representation
feeds both the aux head and the denoiser, so the corruption spread to all structural losses by E8.
Fix applied: Re-detached the pair stack (pair.detach()), deleted corrupted checkpoints (E10–E12),
and restarted from the E4 best checkpoint (val=2.247) with a fresh optimizer. Training immediately resumed smooth improvement,
setting new bests at E7 (2.129) and E8 (2.086).
Trade-off analysis: Detaching the pair stack means aux_dist_ce only trains the DistanceHead MLP, not the pair
representation itself. In principle, end-to-end training of the pair stack through aux loss could improve the learned pair
features. In practice, the aux loss magnitude (~4.0) is much larger than structural losses (~0.3–1.4), and the
cross-entropy gradients are poorly scaled relative to the MSE/FAPE gradients that the pair stack was designed for. The detach
acts as a gradient firewall — the pair stack learns from structural losses (which are well-calibrated), while aux_dist_ce
provides an independent distance prediction that regularizes the DistanceHead without interfering. This is the safer and
empirically superior design. A future approach could use a much smaller aux weight (w_aux=0.01–0.03) or gradient scaling
to enable partial end-to-end training without instability.
Incident Report: Second aux_dist_ce Divergence & Resolution (E14–E16)
What happened: After the pair detach fix stabilized training through E13, aux_dist_ce began diverging
again at E14: 8.9 → 17.5 → 38.8 (doubling every epoch). Structural losses remained stable and excellent
(dst=0.22, fape=1.35, bond=0.006), confirming this was isolated to the DistanceHead.
Root cause: Feature distribution shift. The pair representation is detached before the DistanceHead, so
no gradient flows back — but the pair stack continued evolving via structural losses (dist_mse, FAPE). During the
"topology learning phase" (E10–E13), FAPE began breaking below random, causing rapid pair representation changes. The
DistanceHead's learned feature→bin mapping became stale, producing confidently wrong predictions (CE ≫ ln(96) = 4.56).
This is analogous to a classifier trained on features from a "frozen" encoder where the encoder is actually being updated
by a different objective.
Impact: Even at the reduced weight of w_aux=0.03, the contribution to total loss grew from
0.27 (E14) to 1.16 (E16), exceeding the structural loss (~0.83). This inflated val_total and corrupted patience tracking,
threatening premature early stopping despite continued structural improvement.
Fix applied (E15):
- Set
w_aux=0.0 — aux_dist_ce removed from loss entirely
- Patience tracking switched to structural-only loss:
val_structural = val_total - w_aux * aux_dist_ce
- Rolled back to E14 checkpoint, reset
best_val=0.808 (structural-only), patience=0
- aux_dist_ce still computed and logged as a free diagnostic
Result: E15 (first epoch post-fix) set a new best with val_structural=0.807. Training total
dropped from 1.787 (E16 with aux) to 0.821, reflecting pure structural signal. DDIM metrics: TM=0.129, RMSD=14.53Å.
Lesson: Auxiliary heads that passively observe evolving representations via stop-gradient are inherently
fragile. The DistanceHead architecture requires either (a) its own independent pair stack with end-to-end training, or (b) a
loss function robust to feature drift (e.g., ordinal regression or soft-label CE instead of hard 96-bin classification).
Planned for v11.
| Epoch |
Val Structural |
Val Dist MSE |
Val FAPE |
Val Rg Loss |
Val Bond Geom |
DDIM RMSD |
DDIM TM-score |
Status |
| 1 |
2.938 |
0.536 |
1.311 |
0.232 |
0.167 |
15.49Å |
0.101 |
NEW BEST |
| 2 |
2.429 |
0.433 |
1.380 |
0.045 |
0.035 |
— |
— |
NEW BEST |
| 3 |
2.340 |
0.450 |
1.344 |
0.040 |
0.014 |
15.25Å |
0.117 |
NEW BEST |
| 4 |
2.247 |
0.355 |
1.352 |
0.039 |
0.016 |
— |
— |
NEW BEST |
| 5 |
2.246 |
0.372 |
1.352 |
0.039 |
0.013 |
— |
— |
NEW BEST |
| 6 |
2.248 |
0.380 |
1.366 |
0.039 |
0.010 |
15.91Å |
0.108 |
pat 1 |
| 7 |
2.129 |
0.267 |
1.381 |
0.038 |
0.011 |
— |
— |
NEW BEST |
| 8 |
2.086 |
0.246 |
1.343 |
0.038 |
0.011 |
— |
— |
NEW BEST |
| 9 |
2.106 |
0.241 |
1.351 |
0.037 |
0.013 |
14.92Å |
0.125 |
pat 1 |
| 10 |
2.053 |
0.250 |
1.301 |
0.038 |
0.008 |
— |
— |
NEW BEST |
| 11 |
2.024 |
0.237 |
1.325 |
0.037 |
0.008 |
— |
— |
NEW BEST |
| 12 |
2.028 |
0.232 |
1.342 |
0.036 |
0.007 |
14.58Å |
0.131 |
pat 1 |
| 13 |
2.006 |
0.223 |
1.354 |
0.036 |
0.006 |
— |
— |
NEW BEST |
| 14 |
0.808 |
0.220 |
1.324 |
0.035 |
0.006 |
— |
— |
NEW BEST w_aux→0.0 |
| 15 |
0.807 |
0.217 |
1.350 |
0.035 |
0.006 |
14.53Å |
0.129 |
NEW BEST |
Loss Reference: Random Baseline & Interpretation
For cross-entropy losses, random performance = ln(N) where N is the number of classes
(a uniform predictor assigns 1/N probability to the correct bin, giving −ln(1/N) = ln(N)).
For MSE losses, random = E1 value (Gaussian noise baseline). Solid lines = train, dashed = val, dotted = not in loss (w=0).
| Metric |
w |
Type |
Random |
Best |
Interpretation |
| Structural |
Σ |
weighted |
~2.94 |
0.807 ↓ |
Weighted sum of structural components (excludes aux). Since E15, patience tracks this metric. |
| Dist MSE |
1.0 |
MSE |
~0.54 |
0.217 ↓ |
Pairwise Cα distance error. 60% below random. <0.1 = sub-Å accuracy. |
| Bond |
5.0* |
MSE |
~0.17 |
0.006 ↓ |
Cα–Cα bond length error. Essentially solved. *annealed 1→5 over 15 epochs. |
| FAPE |
0.3 |
L1 |
~1.31 |
1.324 ↓ |
Frame-aligned position error (all residues). Just below random — topology learning starting. Drops <1.0 with correct folds. |
| Rg |
0.5 |
MSE |
~0.23 |
0.035 ↓ |
Radius of gyration error. Well-learned, correct protein size/compactness. |
| Chirality |
0.1 |
MSE |
~0.54 |
0.482 ↓ |
Signed volume (dihedral handedness) of CA quartets. 11% below random. |
| Angle |
0.5 |
MSE |
~0.70 |
0.181 ↓ |
CA–CA–CA bond angle error (cosine). 74% below random, excellent. |
| Aux Dist CE |
0.0 |
CE |
ln(96) = 4.56 |
3.95 (--) |
Disabled at E15. Logged only. Diverged due to pair representation drift (see incident report). Was 0.3→0.03→0.0. |
| Clash |
0.0 |
penalty |
~14.0 |
11.6 (--) |
Logged but not in loss (w=0). Disabled: 3Å threshold in Rg-space maps to ~30Å real. |
| TM-score |
— |
DDIM |
~0.10 |
0.131 ↑ |
50-step DDIM sampling. >0.17 = recognizable folds. Target: >0.30. |
| RMSD |
— |
DDIM |
~15.5Å |
14.53Å ↓ |
<10Å = partial fold. <5Å = high quality. |
Loss Curves (through Epoch 15)
v10: 15 epochs completed. Best val structural loss: 0.807 at epoch 15. Patience: 0/15. Best DDIM TM-score: 0.131 (E12). aux_dist_ce disabled (w=0.0) at E15 after exponential divergence due to pair representation drift; patience now tracks structural-only loss. Running on single A40 GPU (savio3) with auto-resume.
Last updated: 2026-03-09 06:30 UTC
Diffusion v10 — Architecture & Loss Function
Overview
We train a denoising diffusion model for protein backbone (Cα) structure generation,
conditioned on inter-residue contact maps predicted by a frozen ContactClassifier encoder.
The model operates in Rg-normalized coordinate space: all coordinates are divided by
the radius of gyration so the diffusion process is scale-invariant. The denoiser is an
8-layer SE(3)-equivariant graph neural network (EGNN) with 14.6M parameters (13.4M trainable,
1.2M frozen encoder). Training uses the CATH dataset (18,024 train / 608 val proteins,
max 125 residues).
Total Loss
The total loss is a weighted combination of eight components. At epoch \(e\):
$$\mathcal{L}_{\text{total}} = w_{\text{dist}} \cdot \mathcal{L}_{\text{dist}} + \beta(e) \cdot \mathcal{L}_{\text{bond}} + w_{\text{aux}} \cdot \mathcal{L}_{\text{aux}} + w_{\chi} \cdot \mathcal{L}_{\chi} + w_{\theta} \cdot \mathcal{L}_{\theta} + w_{\text{fape}} \cdot \mathcal{L}_{\text{fape}} + w_{\text{rg}} \cdot \mathcal{L}_{\text{rg}}$$
Clash loss and auxiliary distance CE are logged but excluded (\(w_{\text{clash}} = 0\), \(w_{\text{aux}} = 0\)). See incident reports for rationale.
1. Distance MSE \(w_{\text{dist}} = 1.0\)
Mean squared error on all pairwise Cα distances in Rg-normalized space:
$$\mathcal{L}_{\text{dist}} = \frac{1}{|\mathcal{M}|} \sum_{(i,j) \in \mathcal{M}} \left( \| \hat{x}_i^{(0)} - \hat{x}_j^{(0)} \| - \| x_i^{(0)} - x_j^{(0)} \| \right)^2$$
where \(\hat{x}^{(0)}\) is the predicted clean structure, \(x^{(0)}\) is the ground truth, both in
Rg-normalized coordinates, and \(\mathcal{M}\) is the set of valid residue pairs. Clamped to max 10.0.
Random baseline: ~0.54 (MSE of Gaussian noise pairwise distances vs true).
2. Bond Geometry \(\beta(e) = \min(5.0,\; 1.0 + 4.0 \cdot \min(e/15, 1))\)
MSE on consecutive Cα–Cα distances against the ideal 3.8Å bond length (in Rg-normalized space):
$$\mathcal{L}_{\text{bond}} = \frac{1}{L-1} \sum_{i=1}^{L-1} \left( \| \hat{x}_i^{(0)} - \hat{x}_{i+1}^{(0)} \| - \frac{3.8}{R_g} \right)^2$$
The weight is annealed from 1.0 to 5.0 over the first 15 epochs. Starting low prevents bond geometry
from dominating early training when the model hasn't learned global structure. As training matures, the
increasing weight enforces physically valid backbone geometry. Clamped to max 10.0.
Random baseline: ~0.17. Below 0.02 indicates bonds are within 0.1Å of the ideal 3.8Å spacing.
3. FAPE (Frame-Aligned Point Error) \(w_{\text{fape}} = 0.3\)
Measures local structural consistency by constructing rigid frames from consecutive Cα triplets
and computing the error in each frame's local coordinate system:
$$\mathcal{L}_{\text{fape}} = \frac{1}{N_f \cdot L} \sum_{f=1}^{N_f} \sum_{j=1}^{L} \min\!\Big( \| R_f^{\top}(\hat{x}_j - o_f) - R_f^{*\top}(x_j - o_f^*) \|,\; d_{\text{clamp}} \Big)$$
Frames are built from every other triplet of Cα atoms: the x-axis along \(c_2 - c_0\), z-axis from the
cross product, y-axis completing the right-handed system. Unlike AlphaFold's random 14-residue sampling,
v10 uses all residues as targets for stable gradients. Clamped at \(d_{\text{clamp}} = 10.0\).
Random baseline: ~1.31. Drops below 1.0 when the model learns correct fold topology.
This is the hardest loss to reduce because it requires global structural correctness, not just local geometry.
4. Radius of Gyration \(w_{\text{rg}} = 0.5\)
MSE on log-transformed radius of gyration predictions for scale invariance:
$$\mathcal{L}_{\text{rg}} = \left( \log \hat{R}_g - \log R_g \right)^2$$
Since the denoiser works in Rg-normalized space, a separate MLP (\(\texttt{RgPredictor}\))
predicts the absolute radius of gyration from sequence embeddings. This allows
recovering real-space coordinates at inference: \(x_{\text{real}} = \hat{R}_g \cdot \hat{x}_{\text{norm}}\).
Random baseline: ~0.23. Converges below 0.05 by E2.
5. Chirality \(w_{\chi} = 0.1\)
MSE on normalized signed volumes (scalar triple products) of consecutive Cα quartets,
ensuring correct backbone handedness:
$$\mathcal{L}_{\chi} = \frac{1}{L-3} \sum_{i=1}^{L-3} \left( \frac{\mathbf{v}_1 \cdot (\mathbf{v}_2 \times \mathbf{v}_3)}{\|\mathbf{v}_1\| \|\mathbf{v}_2\| \|\mathbf{v}_3\|} \bigg|_{\hat{x}} - \frac{\mathbf{v}_1 \cdot (\mathbf{v}_2 \times \mathbf{v}_3)}{\|\mathbf{v}_1\| \|\mathbf{v}_2\| \|\mathbf{v}_3\|} \bigg|_{x} \right)^2$$
where \(\mathbf{v}_1 = x_{i+1} - x_i\), \(\mathbf{v}_2 = x_{i+2} - x_{i+1}\), \(\mathbf{v}_3 = x_{i+3} - x_{i+2}\).
The normalization maps volumes to \([-1, 1]\), making the loss scale-invariant.
Natural proteins are L-amino acids with consistent chirality; this loss prevents mirror-image structures.
Random baseline: ~0.54 (expected MSE for random values in [-1, 1]).
6. Bond Angle \(w_{\theta} = 0.5\)
MSE on cosines of Cα–Cα–Cα bond angles for consecutive triplets:
$$\mathcal{L}_{\theta} = \frac{1}{L-2} \sum_{i=1}^{L-2} \left( \cos\hat{\theta}_i - \cos\theta_i \right)^2, \quad \cos\theta_i = \frac{(\hat{x}_{i+1} - \hat{x}_i) \cdot (\hat{x}_{i+2} - \hat{x}_{i+1})}{\| \hat{x}_{i+1} - \hat{x}_i \| \| \hat{x}_{i+2} - \hat{x}_{i+1} \|}$$
The ideal Cα–Cα–Cα angle in proteins is ~120° (\(\cos\theta \approx -0.5\)).
Working in cosine space avoids discontinuities at 0°/360°.
Random baseline: ~0.70. Below 0.1 indicates correct backbone angles.
7. Auxiliary Distance Cross-Entropy \(w_{\text{aux}} = 0.0\) (disabled)
Cross-entropy loss on binned pairwise distances, predicted by a DistanceHead MLP from the
detached pair representation:
$$\mathcal{L}_{\text{aux}} = -\frac{1}{|\mathcal{M}'|} \sum_{(i,j) \in \mathcal{M}'} \log p_{ij}\big[\text{bin}(d_{ij})\big], \quad \text{bin}(d) = \left\lfloor \frac{d - d_{\min}}{\Delta} \right\rfloor, \quad \Delta = \frac{d_{\max} - d_{\min}}{N_{\text{bins}}}$$
Parameters: \(N_{\text{bins}} = 96\), \(d_{\min} = 2\)Å, \(d_{\max} = 40\)Å, \(\Delta = 0.396\)Å/bin.
Distances are computed in real space (\(d_{ij} = R_g \cdot \|x_i - x_j\|\)) then binned.
Only pairs within \(d_{\max}\) are included.
The pair representation is detached before entering the DistanceHead, meaning aux gradients
train only the MLP head, not the pair stack. This is critical — an earlier experiment without detach
caused exponential divergence (see incident report in Training tab).
Random baseline: \(\ln(96) = 4.56\). A uniform predictor assigns \(1/96\) to each bin,
giving \(-\ln(1/96) = \ln(96)\). Weight history: 0.3 (E1–E13) → 0.03 (E14) → 0.0 (E15+, disabled).
Disabled because the pair representation evolves via structural losses while the DistanceHead observes it
through a stop-gradient wall, causing irrecoverable feature drift and exponential CE divergence.
Still computed and logged as a diagnostic.
8. Clash Loss \(w_{\text{clash}} = 0.0\) (disabled)
Contact-guided steric clash penalty on non-bonded atoms closer than 3.0Å:
$$\mathcal{L}_{\text{clash}} = \frac{1}{|\mathcal{N}|} \sum_{(i,j) \in \mathcal{N}} \left[ (1 + 2 \cdot c_{ij}) \cdot \text{ReLU}(3.0 - d_{ij}) \right]^2$$
where \(\mathcal{N}\) is the set of non-bonded pairs (\(|i - j| > 1\)), \(c_{ij}\) is the predicted
contact probability (detached), and \(d_{ij}\) is the Rg-normalized distance.
Why disabled: The 3.0Å threshold is applied in Rg-normalized coordinates, but for a
typical protein with \(R_g \approx 10\)Å, this maps to \(3.0 \times 10 = 30\)Å in real space —
penalizing nearly all non-bonded pairs. This produced noisy gradients that dominated ~47% of the total
loss in v9, drowning the structural learning signal. Removing clash led to immediate training stability
and the structural losses handle steric quality implicitly (clash decreases organically as the model
generates better structures).
Training Configuration
| Optimizer | AdamW (\(\beta_1=0.9, \beta_2=0.999\)) |
| Peak learning rate | \(10^{-4}\) |
| LR schedule | 5-epoch linear warmup (\(0.01 \times \text{lr} \to \text{lr}\)), then CosineAnnealingWarmRestarts (\(T_0 = 15\) epochs, \(\eta_{\min} = 10^{-5}\)) |
| Batch size | 8 (grad accumulation = 2, effective = 16) |
| Mixed precision | AMP with GradScaler, \(\texttt{dx\_clamp} = 0.5\) |
| EMA | Decay = 0.999, used for DDIM evaluation |
| Self-conditioning | 50% probability during training |
| Early stopping | Patience = 15 epochs on val structural loss (excludes aux_dist_ce) |
| DDIM evaluation | 50 steps, every 3 epochs, using EMA weights |
| Hardware | Single NVIDIA A40 (48 GB), ~15 min/epoch |
| Dataset | CATH 4.3: 18,024 train / 608 val, max 125 residues |
DDIM Evaluation Metrics
Every 3 epochs, we generate structures via 50-step DDIM sampling using EMA weights and evaluate against
ground truth:
- TM-score (Template Modeling): Global fold similarity, range [0, 1]. Scores > 0.17 indicate
the same fold; > 0.5 is the same topology. Random structures score ~0.10.
- RMSD (Root Mean Square Deviation): Average atomic displacement after optimal superposition.
Random placement gives ~15–16Å. Below 5Å is high quality.
- GDT (Global Distance Test): Fraction of residues within 1–8Å of the true position.
Random gives ~3–4%.