Scored designs accumulated over 24 hours of continuous generation, scoring, and validation. Wave 1 (676 designs) completed in ~2 hours. Wave 2 gen10k campaign added ~9,000 designs over 6 hours. Wave 3 contributed another ~3,500. Growth plateaued at 14,229 as generation jobs finished.
BoltzGen: No hotspot conditioning — binder targets determined by model
RFdiffusion Checkpoints
Complex_base_ckpt.pt — standard binder design (mostly helical)
Complex_beta_ckpt.pt — diverse topology binder design (mixed sheet/helix)
Design Naming Convention
Design Naming Convention:
- beta_e2_100-026: beta campaign, E2 face, 100-design batch, design #26
- sat_e2-0675: satellite E2 campaign, design #675
- g10k_e2_sm_j11_binder_125: gen10k E2 small, job 11, binder #125
- cul_small_10-017: cullin face, small binder, 10-design batch, #17
- e2_med_10-027: E2 face, medium binder, 10-design batch, #27
- BG Small 3: BoltzGen small pilot, design #3
Campaigns:
- beta_e2: Early E2 face designs from beta RFdiffusion checkpoint
- sat_e2: Satellite E2 campaigns (diverse hotspot sampling)
- gen10k / g10k: 10,000-scale generation campaign
- cul_small / cul_med: Cullin face, small (40-65 AA) / medium (90-120 AA) binders
- e2_small / e2_med: E2 face, small / medium binders
- BoltzGen: Designs from BoltzGen generative model (not RFdiffusion)
AlphaFold3 Validation
28 top designs validated with AlphaFold3. AF3 ipTM is the ground-truth orthogonal validation.
Most RFdiffusion+MPNN E2 face designs show massive Boltz-2 score inflation (delta 0.5-0.86).
BoltzGen designs targeting the full RBX1 surface show better AF3 agreement.
Boltz-2 vs AF3 ipTM
Each point is one validated design. Dashed line = perfect agreement. BoltzGen designs (green) cluster near the diagonal, while RFdiffusion+MPNN E2 designs (blue) show extreme inflation. Edge color indicates AF3 flag: black = top hit, green = promote, orange = review, gray = discard.
AF3 vs Boltz ipTM per design
Solid bars = AF3 ipTM, faded bars = Boltz ipTM. Sorted by AF3 score. Only design_0673 (AF3=0.85) and design_0357 (AF3=0.78) pass ipTM > 0.5.
OF3 shows much better agreement with Boltz-2 than AF3 does — the cloud sits higher. Top OF3 hits (design_1459 at 0.776, design_0283 at 0.749) still show significant Boltz inflation but the correlation is stronger. E2 face designs (blue) show more inflation than Cullin (red).
AF3 vs OpenFold3 ipTM
For the 20 designs with both AF3 and OF3 scores, OF3 is consistently more generous than AF3. Most designs fall above the diagonal. The two methods agree on relative ranking but not absolute values — OF3 ipTM is ~2-3x higher than AF3 ipTM for the same designs.
Three-method comparison (top 20 by OF3)
Side-by-side Boltz-2 (blue, faded), OpenFold3 (red), and AF3 (green) ipTM for the top 20 OF3 designs. The gap between Boltz and OF3/AF3 is consistent across designs. Faint green bars indicate designs without AF3 data.
Pipeline Performance Comparison
Meta-analysis of design pipelines across campaigns. All plots generated from Python analysis scripts.
Pipeline Comparison (Boxplots)
Boxplots of ipTM, ipSAE, and pLDDT by sub-campaign. Scale-up and wave2 MPNN designs consistently outperform BoltzGen across all metrics.
Hit Rate Comparison
Cumulative hit rate curves showing the fraction of designs above various ipTM thresholds. Scale-up and wave2 campaigns show similar performance profiles.
ipTM vs Binder Length
Relationship between binder length and ipTM score. Small binders (40-65 AA) dominate the dataset and show wide ipTM variation. No clear length-performance correlation observed.
Metric Distributions
ipTM by Campaign
Distribution of ipTM scores stratified by campaign. Both E2 face campaigns (scale-up and wave2) show bimodal distributions with peaks near 0.3 and 0.8.
ipSAE by Campaign
ipSAE distribution by campaign (real values from Boltz-2 ranking). Higher ipSAE indicates tighter predicted interface contacts.
pLDDT by Campaign
pLDDT distribution across campaigns. Most scored designs cluster between 0.6 and 0.8 pLDDT, with the best designs reaching 0.892.
Binder Length Distribution
Binder length distribution across all designs colored by campaign. The majority are small binders (40-65 AA) from E2 face campaigns.
Hotspot Set Analysis
Each RFdiffusion campaign was conditioned on a specific set of RBX1 hotspot residues. Four distinct hotspot sets were used:
Joint distribution of ipSAE and ipTM colored by hotspot set. Marginal histograms show per-set density. Stars mark the best design in each set. The E2 Enhanced set (with ESM-2 DMS-derived residues I54, I84, C42, C53) produces the highest ipSAE values, suggesting that adding mutation-sensitive residues to the hotspot specification improves interface quality.
ipSAE vs binder length by hotspot set
Binder length vs ipSAE by hotspot set. Small binders (40-65 AA) dominate the E2 Enhanced set. BoltzGen designs span a wider length range but achieve lower ipSAE. No strong correlation between length and interface quality within any set, though medium-length binders (50-60 AA) appear slightly enriched among top performers.
Hit rate by hotspot set
Fraction of designs exceeding ipSAE thresholds for each hotspot set. The E2 Enhanced set maintains the highest hit rate at all thresholds. The Cullin set shows competitive hit rates despite targeting a different face, while BoltzGen has the highest fraction above ipSAE=0.3 but drops off sharply above 0.5.
Efficiency frontier: ipSAE vs pLDDT
Pareto front (dashed line) of designs optimizing both ipSAE and pLDDT. Designs on the frontier achieve the best trade-off between interface quality and structural confidence. Most frontier designs are from the E2 Enhanced set, but a few Cullin and E2 Standard designs appear at high pLDDT.
Amino acid composition by ipSAE quality tier
Amino acid frequency comparison between top-tier (ipSAE > 0.7), mid-tier (0.3-0.7), and failed (ipSAE < 0.1) designs. Blue = charged (R,K,D,E), red = hydrophobic (F,L,I,M,V,W), green = polar (S,T,C,N,Q,Y). Top-tier designs show higher glutamate (E) and leucine (L) frequency, while failed designs are enriched in alanine (A) and glycine (G), suggesting that oversimplified sequences with low complexity correlate with poor interface formation.
Evolutionary Analysis of RBX1
Conservation analysis from 165-sequence MSA (MAFFT) of RBX1 homologs across eukaryotes.
834 raw homologs collected via HMMER, filtered to 272 non-redundant sequences, aligned with 1155 total rows including outgroups.
RBX1 is deeply conserved across eukaryotes. The RING-H2 domain (residues ~27–104) shows uniformly high conservation
(>0.7 for nearly all positions), with zinc-coordinating cysteines and histidines reaching near-perfect conservation (0.91–1.0).
The N-terminal tail (residues 1–20) is more variable, consistent with it being unstructured in the NMR ensemble (2LGV).
The Cullin face is significantly more conserved than the E2 face (mean 0.938 vs 0.839). This makes sense:
the Cullin interaction is constitutive (RBX1 is always bound to a Cullin scaffold in vivo), whereas the E2 interface cycles through
multiple E2 partners. The higher conservation at the Cullin face means binders targeting it are more likely to disrupt a
functionally critical interaction, but the surface may also be harder to compete with due to the tight, conserved binding.
Three perfectly conserved residues stand out: W33, W35 (both Cullin face), G73 (Cullin face), and R46 (E2 face).
These are absolutely invariant across all 165 sequences in the MSA. W33 and W35 form a tryptophan pair that likely stacks against
the Cullin surface—a classic hot-spot motif. R46 is the catalytic arginine critical for E2 activation.
Gap fraction is low (<2%) for the core domain (residues 36–105), meaning the MSA is well-aligned in the
structured region. The N/C-terminal tails show higher gaps (10–27%), reflecting length variation among homologs.
This gives confidence that the conservation scores in the core are reliable.
Design implications: The 60/40 E2/Cullin split is well-justified. The Cullin face offers a tighter, more conserved
target with multiple tryptophan hot-spots, favoring high-affinity designs. The E2 face is more diverse, potentially offering more
epitope options but requiring designs that can outcompete the native E2 partners. The zinc sites should be included in the target
structure but not directly targeted—they are buried and structurally critical, not surface-accessible.
RBX1 Sequence — Conservation Colored
Each residue colored by conservation score. Hover for details.
ESM-2 Deep Mutational Scanning
Masked marginal log-likelihood ratios (dLLR) computed with ESM-2 (650M params) for all single-point mutations across the 108-residue RBX1 sequence.
More negative dLLR = more deleterious mutation. Sensitivity = mean |dLLR| across all 20 amino acids at each position.
2,160
Mutations Scored
0.581
Spearman r (vs Evo)
0.714
Pearson r (vs Evo)
-18.25
Most Deleterious dLLR
+2.54
Most Beneficial dLLR
31
Predicted Contacts
Mutation Effect Heatmap
Log-likelihood ratio for each of 20 amino acids at each position. Blue = tolerated/beneficial, red = deleterious. Wild-type residue marked with black dot. Hover for values.
Per-Position Mutation Sensitivity
Mean |dLLR| across all 20 amino acids. Higher = less tolerant of mutations. Colored by functional role.
ESM Sensitivity vs Conservation
Spearman r = 0.581, Pearson r = 0.714. Strong agreement between model-predicted and evolutionary constraint.
ESM-2 and evolution strongly agree on which residues are critical. The Spearman correlation of 0.581 (p < 10-10)
and Pearson of 0.714 (p < 10-17) between ESM-2 sensitivity and Shannon entropy conservation confirm that the protein language model
has learned genuine structural and functional constraints from sequence alone.
Zinc-coordinating cysteines dominate the sensitivity landscape. C42, C56, C83, C75, C53, C68, and C94 are all among the top 15
most sensitive positions. Any mutation at these sites is catastrophic (dLLR < -10), consistent with their role as structural zinc ligands
that maintain the RING-H2 fold. The most deleterious single mutation in the entire protein is I54W (dLLR = -18.25), a massive tryptophan
insertion into the hydrophobic core adjacent to C53 and C56.
The Cullin face has the single most sensitive residue (D36, sensitivity = 13.0), which also has perfect evolutionary
conservation (0.945). This aspartate likely forms critical salt bridges in the Cullin interaction. For binder design, targeting residues
around D36 could be highly effective at disrupting the complex.
ESM-2 identifies some positions as sensitive that conservation misses. I54 (ESM sensitivity rank 5, conservation only 0.786)
and I84 (rank 11, conservation 0.802) are moderately conserved but ESM predicts they are among the most intolerant of mutations —
likely because they play critical roles in hydrophobic packing that the MSA alone doesn't fully capture.
Design implications: Binders should maximize contacts with high-sensitivity residues (especially D36, C42, R46, F79, W87)
since these positions cannot easily mutate to escape binding. The DMS data also suggests that the N-terminal tail (residues 1–20)
has low mutation sensitivity, confirming it is a poor target for binder design.
AF3 Validation Insights
Comprehensive analysis of what predicts AlphaFold3 validation success.
~51 designs validated with AF3, analyzed using Lasso regression on sequence and structural features.
AF3 is our ground-truth orthogonal validation — Boltz-2 scores are heavily inflated for E2-face RFdiffusion designs.
~51
AF3 Validated
7
Top Hits (>=0.7)
5
Promote (0.5-0.7)
4
Review (0.4-0.5)
35
Discard (<0.4)
Lasso Regression: What Predicts AF3 Success?
LassoCV regression on 20 sequence and design features to predict AF3 ipTM. Features with non-zero
coefficients are the strongest predictors after regularization. Positive = helps AF3 validation,
negative = hurts.
Lasso regression coefficients (standardized). The strongest predictor of AF3 success
is being a BoltzGen design (untargeted, diverse topology), followed by sequence features like aromatic content
and lower alanine fraction. High Boltz-2 ipTM is actually a weak or negative predictor — designs with inflated
Boltz scores tend to fail AF3 validation.
Top Feature Correlations
Top 8 features correlated with AF3 ipTM. Each subplot shows one feature vs AF3 score
with Ridge regression line and Pearson r. Points colored by pipeline (blue=RFdiff E2, red=RFdiff Cullin, green=BoltzGen).
What Distinguishes Success from Failure?
Comparison of successful designs (AF3 ipTM >= 0.5) vs failed designs (AF3 ipTM < 0.3)
across key features. Successful designs tend to have higher sequence entropy (more diverse amino acid usage),
more aromatic residues, and lower alanine content.
Amino Acid Composition
Amino acid frequency comparison between AF3-validated (ipTM >= 0.5) and
AF3-failed (ipTM < 0.2) designs. Successful designs are enriched in structurally important residues
(F, W, Y aromatics; charged residues) and depleted in simple residues (A, G).
BoltzGen vs RFdiffusion in AF3
BoltzGen designs show dramatically better AF3 validation than RFdiffusion+MPNN
designs. BoltzGen mean AF3 ipTM is 2-3x higher, despite lower Boltz-2 scores. This suggests RFdiffusion+MPNN
designs may be optimizing for Boltz-2 scoring artifacts rather than genuine binding.
Key Findings
1. BoltzGen designs validate significantly better than RFdiffusion+MPNN designs in AF3,
despite having lower Boltz-2 scores. This is the single strongest predictor of AF3 success.
2. Sequence diversity matters: designs with higher Shannon entropy (more diverse AA usage)
and more unique amino acids tend to validate better. Low-complexity sequences rich in alanine and glycine
consistently fail.
3. Aromatic residues (F, W, Y) are enriched in successful designs — these contribute to
specific hydrophobic contacts at the interface that AF3 can validate.
4. High Boltz-2 ipTM is NOT predictive of AF3 success — in fact, it may be slightly
anti-correlated. Designs with Boltz ipTM > 0.95 mostly fail AF3 validation (delta > 0.7).
5. Cullin-face designs show better AF3 agreement than E2-face designs, though the sample
size is small. The one Cullin design tested (design_0357) has AF3 ipTM = 0.78 with low delta.
6. For future campaigns: prioritize BoltzGen-style generation, increase aromatic content in
ProteinMPNN sampling, and reduce alanine/glycine bias. Consider Cullin-face targeting which appears more
AF3-compatible.