Boltz-1 논문 리뷰

들어가며

AlphaFold3 이후 biomolecular modeling의 중심은 단백질 하나의 구조 예측에서 complex interaction 예측으로 이동했습니다. Protein, nucleic acid, ligand, ion, modified residue를 하나의 all-atom coordinate framework에서 다루는 방향입니다. 문제는 공개 범위였습니다. AlphaFold3 논문은 큰 전환점이었지만, model weights와 training code는 공개되지 않았고 사용은 server 중심으로 제한됐습니다.

Boltz-1은 이 지점을 겨냥합니다. 논문 제목은 “Boltz-1: Democratizing Biomolecular Interaction Modeling”입니다. MIT CSAIL/Jameel Clinic 중심의 academic-led work이고, Genesis Research/Genesis Therapeutics involvement가 있습니다. 이 글에서는 Boltz-1을 “AF3를 이긴 모델”이 아니라, AF3-style biomolecular interaction prediction을 open infrastructure로 가져오려는 시도관점에서 살펴보겠습니다.

이 framing은 후속 글을 읽을 때도 중요합니다. BoltzDesign1은 Boltz-1 predictor를 거꾸로 미분해 binder hallucination에 쓰고, Boltz-2는 affinity prediction layer를 붙이며, BoltzGen은 design/generation으로 확장합니다. Boltz-1 자체는 generator가 아닙니다. 하지만 후속 design pipeline의 evaluator이자 structural backbone infrastructure가 됩니다.

그래서 Boltz-1 리뷰의 중심 질문은 “이 모델이 몇 점을 받았나”보다 “open predictor가 downstream design loop에서 어떤 역할을 맡을 수 있나”에 가깝습니다. 후보 생성, 구조 예측, confidence ranking, physical sanity check, affinity prediction은 서로 다른 layer입니다. Boltz-1은 그중 구조 예측과 confidence layer를 열어준 모델입니다.

AF3 이후 남은 공개성의 문제

AlphaFold3의 큰 변화는 scope였습니다. AlphaFold2가 protein structure prediction을 사실상 표준화했다면, AF3는 protein-protein, protein-ligand, protein-nucleic acid, ion, modification까지 더 넓은 biomolecular context를 한 모델 안으로 가져왔습니다.

이런 model은 drug discovery와 protein design에서 매우 유용합니다. Generated binder candidate를 평가하거나, ligand pose를 확인하거나, nucleic acid/protein complex를 modeling할 수 있기 때문입니다. 하지만 closed model이면 연구자가 failure mode를 직접 고치기 어렵고, commercial/academic workflow에 자유롭게 넣기도 까다롭습니다.

Boltz-1의 문제의식은 그래서 method novelty만이 아닙니다. Reproducible and modifiable AF3-style system을 만들 수 있는가가 핵심입니다. Paper는 training code, inference code, model weights, datasets, benchmarks를 MIT license로 공개한다고 명시합니다.

이 차이는 protein design 현장에서 큽니다. Closed server는 빠르게 써볼 수 있지만, 실패한 케이스를 분석하거나 domain-specific filtering을 붙이거나, 내부 후보 생성 pipeline과 tight하게 묶는 데 한계가 있습니다. Open model은 성능 숫자가 조금 낮더라도, failure mode를 계측하고 workflow에 맞게 바꿀 수 있습니다. Boltz-1의 “democratizing”이라는 단어는 이 지점을 겨냥합니다.

Model scope: biomolecular interaction prediction

Boltz-1은 protein, ligand, nucleic acid, ion, modified residue를 all-atom coordinate prediction 대상으로 다룹니다. Protein과 nucleic acid는 residue/base token으로, ligand와 modified residue는 heavy-atom token에 가깝게 표현합니다. MSA와 molecular conformer information도 input에 들어갑니다.

큰 구조는 AF3-style입니다. Trunk가 token/pair representation을 만들고, Pairformer가 interaction representation을 업데이트하며, diffusion module이 all-atom coordinates를 생성합니다. Confidence model은 predicted structure의 reliability를 평가합니다.

중요한 boundary도 있습니다. Boltz-1은 structure predictor입니다. Sequence를 새로 설계하거나 binder를 생성하는 model이 아닙니다. Design workflow에서는 candidate evaluator/filter로 쓰일 수 있지만, 그 자체가 design success를 의미하지는 않습니다.

이 boundary를 놓치면 후속 Boltz 계열을 읽을 때 혼동이 생깁니다. BoltzDesign1은 predictor inversion, Boltz-2는 affinity prediction, BoltzGen은 generative design입니다. 세 모델 모두 Boltz-1과 연결되지만, evidence layer는 각각 다릅니다. Boltz-1의 성공은 benchmark structure accuracy와 open reproducibility입니다.

Dense MSA pairing

Multimeric protein complex에서는 MSA pairing이 중요합니다. 어떤 sequence row와 어떤 partner-chain row를 pair할지에 따라 co-evolution signal이 달라집니다. Boltz-1은 taxonomy information을 이용해 dense MSA pairing을 수행합니다.

이 접근은 paired signal과 unpaired MSA density 사이의 균형을 맞추려는 시도입니다. 너무 보수적으로 pair하면 paired MSA가 빈약해지고, 너무 느슨하면 잘못된 pairing noise가 들어갑니다. Boltz-1은 이 trade-off를 engineering component로 다룹니다.

이 section은 benchmark number보다 less visible하지만, complex prediction에서 실무적으로 중요합니다. Protein-protein complex나 multi-chain assembly에서는 input MSA construction이 모델 성능의 일부가 됩니다.

Unified cropping

Training crop도 Boltz-1의 중요한 engineering choice입니다. Complex structure는 전체를 한 번에 넣기 어렵고, local interaction만 보면 global context가 빠집니다. Boltz-1은 spatial crop과 contiguous crop 사이를 interpolation하는 unified cropping algorithm을 사용합니다.

Spatial crop은 3D neighborhood를 잘 포착하지만 sequence-contiguous context가 약할 수 있습니다. Contiguous crop은 chain-local context를 유지하지만 interface나 ligand pocket처럼 공간적으로 가까운 residue를 놓칠 수 있습니다. Boltz-1은 neighborhood size를 sample해 두 extreme 사이를 오가게 합니다.

이런 choice는 구조 예측 모델에서는 매우 실용적인 부분입니다. Dataset과 architecture가 같아도 어떤 crop을 보며 학습했는지에 따라 interaction representation이 달라질 수 있습니다.

Robust pocket conditioning

Boltz-1은 ligand pocket conditioning도 다룹니다. 실제 사용자는 ligand pocket 전체를 정확히 알고 있는 경우보다, 몇 개 key residue만 알고 있는 경우가 많습니다. Boltz-1은 training 중 일부 iteration에서 pocket/binder feature를 넣고, 6 Å 내 pocket residue 중 일부만 reveal합니다.

이 설계는 separate pocket-conditioned model을 만들지 않고, single unified model이 partial pocket specification에 robust하도록 하는 방향입니다. 즉 사용자가 complete pocket annotation을 주지 않아도, 제한된 residue hint를 이용해 ligand pose prediction을 도울 수 있게 합니다.

다만 pocket conditioning은 prediction support입니다. 특정 pocket을 줬다고 해서 ligand binding affinity나 selectivity가 보장되는 것은 아닙니다. 여기서의 evidence는 structure prediction benchmark와 ligand pose proxy입니다.

Kabsch diffusion interpolation

Boltz-1은 diffusion reverse step에서 Kabsch interpolation을 도입합니다. Non-equivariant denoising model에서는 training-time aligned MSE와 inference-time coordinate update 사이에 mismatch가 생길 수 있습니다. Boltz-1은 noisy coordinates와 denoised coordinates를 Kabsch alignment한 뒤 interpolate해 이 mismatch를 줄이려 합니다.

이 부분은 조금 기술적이지만, AF3-style diffusion predictor에서 중요한 detail입니다. All-atom coordinate generation은 rotation/translation symmetry를 갖습니다. Alignment를 어떻게 처리하느냐가 denoising trajectory의 stability와 physical plausibility에 영향을 줄 수 있습니다.

후속 Boltz-1x가 physical validity steering을 추가하는 것도 이 맥락에서 볼 수 있습니다. Boltz-1류 모델의 challenge는 평균 structure accuracy뿐 아니라, steric clash, overlapping chain, ligand internal geometry 같은 failure mode를 줄이는 데 있습니다.

Confidence model redesign

Boltz-1은 confidence prediction도 크게 손봅니다. AF3처럼 small confidence head만 붙이는 대신, pretrained trunk weights로 initialize한 full trunk-like confidence model을 fine-tune합니다. Diffusion trajectory의 activations도 recurrent aggregation해서 confidence prediction에 넣습니다.

이 choice는 design workflow에서 특히 중요합니다. 후보를 많이 만들고 ranking/filtering할 때 confidence score가 실제 precision을 좌우하기 때문입니다. 다만 confidence는 structural accuracy proxy입니다. Binding affinity나 biological function을 직접 예측하는 값은 아닙니다.

Boltz-2가 affinity prediction을 별도 target으로 다루는 이유도 여기 있습니다. Boltz-1 confidence는 구조가 얼마나 그럴듯한지에 대한 signal이고, binding thermodynamics는 다른 evidence layer입니다.

Data와 training setup

Boltz-1은 2021년 9월 30일 이전 PDB structures와 OpenFold distillation dataset을 사용합니다. PDB training data는 resolution ≤ 9 Å까지 포함하고, Biological Assembly 1을 parsing합니다. Ligand는 CCD dictionary와 conformer/atom matching을 사용합니다.

Training은 총 68k steps, batch size 128입니다. 처음 53k iterations는 crop size 384 tokens / 3456 atoms로 PDB와 OpenFold distillation을 반반 사용하고, 마지막 15k iterations는 PDB만 사용하면서 crop size 512 tokens / 4608 atoms로 fine-tune합니다.

논문은 AF3가 유사 architecture를 150k steps, batch size 256으로 학습했다는 점과 비교해 Boltz-1이 약 4배 적은 compute로 학습됐다고 설명합니다. 이 비교는 exact reproduction claim이라기보다, open-source group이 AF3-level benchmark에 접근 가능한 compute regime을 보여주는 evidence로 이해하면 충분합니다.

Benchmark setup

Boltz-1은 recent PDB-derived test set과 CASP15에서 평가됩니다. Recent PDB test는 filtering 후 541 structures, CASP15는 66 structures입니다. Baseline은 AlphaFold3와 Chai-1입니다.

모든 모델은 200 sampling steps, 10 recycling rounds, 5 outputs를 생성하고, precomputed MSAs up to 16,384 sequences를 사용합니다. Evaluation은 top confidence prediction과 5 samples 중 best oracle을 함께 봅니다.

Metrics는 mean all-atom LDDT, protein-protein DockQ > 0.23, protein-ligand LDDT-PLI, ligand pocket-aligned RMSD < 2 Å입니다. 각 metric은 다른 interaction layer를 봅니다. 전체 구조 정확도, protein-protein interface, protein-ligand interface, ligand pose를 한 숫자로 합치지 않는 것이 좋습니다.

Top-1과 oracle도 분리해서 봐야 합니다. Top-1은 confidence model이 실제 사용 상황에서 가장 그럴듯한 구조를 잘 고르는지에 가깝습니다. Oracle은 5개 sample 중 가장 좋은 것을 고르는 것이므로 model의 sample pool potential을 보여주지만, 실제 사용자가 정답 구조를 미리 알 수 있다는 뜻은 아닙니다. 후보 filtering에서는 top-1 precision이 더 실용적이고, method comparison에서는 oracle이 generation capacity를 보조적으로 보여줍니다.

Main result: AF3/Chai-1과 비슷한 benchmark range

Figure 4 기준 Boltz-1, AlphaFold3, Chai-1은 PDB test와 CASP15에서 대체로 비슷한 성능을 보입니다. AlphaFold3가 mean LDDT에서 약간 앞서고, 저자들은 RNA/DNA complex에서 extra distillation data가 영향을 줬을 가능성을 제시합니다.

Protein-protein DockQ > 0.23에서는 AF3가 test set에서 Boltz-1/Chai-1보다 약간 높지만 confidence interval 안입니다. Protein-ligand LDDT-PLI와 ligand RMSD < 2 Å에서는 AF3와 Boltz-1이 Chai-1보다 약간 높고, 차이는 confidence interval 안에서 해석됩니다.

Table 1 ablation에서는 recycling과 diffusion step 수를 늘리면 성능이 대체로 좋아지지만, 3 recycling / 50 diffusion steps 이후에는 plateau에 가깝습니다. 3 recycles / 200 steps 기준 test set top-1은 mean LDDT 0.716, DockQ > 0.23 0.625, LDDT-PLI 0.580, ligand RMSD < 2 Å 0.545입니다. Oracle 값은 각각 0.729, 0.654, 0.621, 0.581입니다.

Openness as evidence

Boltz-1의 가장 중요한 result는 benchmark 숫자만이 아닙니다. Paper와 README는 training/inference code, model weights, datasets, benchmarks를 MIT license로 공개한다고 명시합니다. Academic and commercial use가 가능한 open license라는 점도 큽니다.

이 공개 범위 때문에 Boltz-1은 AF3/Chai-1과의 성능 비교를 넘어섭니다. 연구자는 architecture, training, benchmark, failure case를 직접 재현하고 수정할 수 있습니다. 이후 Boltz-1x, Boltz-2, BoltzDesign1, BoltzGen 같은 line이 빠르게 나온 것도 이 open infrastructure의 효과로 볼 수 있습니다.

따라서 Boltz-1을 읽을 때 “AF3보다 좋은가”만 묻는 것은 좁습니다. 더 중요한 질문은 “AF3-style biomolecular interaction prediction이 community-modifiable infrastructure가 되었는가”입니다.

Failure mode: physical plausibility

Boltz-1은 AF3-level benchmark range에 접근하지만, failure mode도 보고합니다. Figure 5에는 overlapping chains와 steric clashes 같은 non-physical structures가 나옵니다. 저자들은 일부 overlapping-chain failure가 PDB training data의 overlapping ligand나 large complex crop size limitation과 관련될 수 있다고 설명합니다.

이 failure는 사소한 그림 오류가 아닙니다. Ligand pose가 RMSD 기준으로 가까워 보여도 chiral center나 bond geometry가 틀리면 chemistry가 달라집니다. Protein chain overlap은 complex interface confidence를 부풀릴 수 있고, downstream docking/MD/affinity model에 잘못된 입력을 줄 수 있습니다. Boltz-1x의 physical steering은 바로 이 틈을 겨냥합니다.

이 점은 design filtering에서 중요합니다. Structure predictor의 confidence가 높아도 physical validity check는 별도로 봐야 합니다. Ligand internal geometry, covalent bond, chirality, clash, inter-chain overlap은 downstream design pipeline에서 false positive를 만들 수 있습니다.

Boltz-1x가 inference-time steering으로 chirality, stereochemistry, ligand internal geometry, inter-chain clash, covalent bond 등을 개선하려는 것도 이 문제의 후속 대응입니다. Boltz-1 리뷰에서는 이를 “prediction accuracy와 physical validity는 같은 metric이 아니다”라는 해석 기준으로 남겨두면 됩니다.

BoltzDesign1과의 연결

BoltzDesign1은 Boltz-1을 binder generator처럼 직접 fine-tune한 것이 아니라, Boltz-1의 Pairformer/Confidence module을 역으로 미분해 design objective로 사용합니다. Diffusion module 전체를 backpropagation하지 않고 distogram/confidence proxy를 활용합니다.

이 연결을 이해하려면 Boltz-1의 role을 정확히 잡아야 합니다. Boltz-1은 candidate를 생성하는 model이 아니라, biomolecular complex structure를 예측하고 confidence를 주는 model입니다. BoltzDesign1은 그 predictor의 internal representation을 design objective로 바꾼 파생 workflow입니다.

즉 Boltz-1의 benchmark accuracy와 confidence quality가 후속 design pipeline의 foundation이 됩니다. 하지만 후속 pipeline의 in silico success나 wet-lab result가 Boltz-1 자체의 validation으로 자동 환원되지는 않습니다.

Boltz-2와 BoltzGen으로 이어지는 선

Boltz-2는 Boltz-style structure modeling에 affinity prediction을 결합하는 방향으로 갑니다. Boltz-1 confidence가 structural accuracy proxy라면, Boltz-2는 binding strength를 더 직접적으로 다루려는 step입니다.

BoltzGen은 더 나아가 all-atom generative binder design으로 들어갑니다. Nanobody, protein binder, peptide, small-molecule binder 같은 broad modality를 하나의 design specification language로 묶으려 합니다.

이 lineage를 보면 Boltz-1은 “open AF3-style predictor”라는 출발점입니다. 그 위에 affinity prediction, inverse design, universal binder generation이 쌓입니다. 그래서 Boltz-1 리뷰는 후속 Boltz 계열 리뷰의 기준점 역할을 합니다.

Figure별로 보기

Figure 1은 test-set example predictions입니다. TM-score around 0.95 examples를 보여주지만, representative showcase로 이해해야 합니다.

Figure 2는 AF3 reverse diffusion과 Boltz-1 Kabsch interpolation 차이를 설명합니다. 이 figure는 Boltz-1의 coordinate diffusion engineering을 이해하는 데 중요합니다.

Figure 3은 Boltz-1 architecture와 redesigned confidence model입니다. Pairformer, diffusion module, confidence model이 어떻게 연결되는지 확인하는 section입니다.

Figure 4는 main benchmark입니다. AF3, Chai-1, Boltz-1을 PDB test와 CASP15에서 LDDT, DockQ, LDDT-PLI, ligand RMSD로 비교합니다.

Figure 5는 failure examples입니다. Review에서 Figure 5를 빼면 Boltz-1이 구조 예측 accuracy만 있는 깨끗한 성공담처럼 보일 수 있습니다. Physical validity를 따로 봐야 하는 이유가 여기서 나옵니다.

Algorithm 2–4는 Dense MSA Pairing, Unified Cropping, Robust Pocket-Conditioning입니다. Boltz-1의 practical contribution은 architecture diagram만이 아니라 이런 engineering details에도 있습니다.

Evidence layer를 분리해서 읽기

Boltz-1의 evidence는 세 층으로 나눠 읽는 편이 좋습니다. 첫 번째는 benchmark structure accuracy입니다. LDDT, DockQ, LDDT-PLI, ligand RMSD가 여기에 들어갑니다. 이 층은 AF3/Chai-1과 비슷한 prediction range에 접근했는지를 봅니다.

두 번째는 reproducibility/open infrastructure evidence입니다. Code, weights, dataset, benchmark 공개 범위와 license가 여기에 들어갑니다. Boltz-1의 가장 큰 의미는 이 층에 있습니다.

세 번째는 downstream utility입니다. BoltzDesign1, Boltz-2, BoltzGen 같은 후속 work가 Boltz-style predictor를 design/affinity/generation에 활용합니다. Downstream utility는 Boltz-1의 open infrastructure가 실제 design ecosystem에 연결되는 방식을 보여주는 층입니다.

읽을 때의 균형점

Boltz-1은 binder generator가 아니라 biomolecular complex structure predictor입니다. 이 구분을 잡아두면 논문의 contribution이 더 잘 보입니다. Boltz-1의 중심은 sequence를 새로 설계하는 데 있지 않고, AF3-style interaction modeling을 open code, weights, datasets, benchmarks로 재현 가능한 infrastructure로 만든 데 있습니다.

Benchmark accuracy는 structure prediction evidence입니다. Binding, affinity, specificity, function, developability는 downstream assay나 별도 model이 다루는 층입니다. Confidence score도 affinity score라기보다 structural accuracy proxy로 이해하면 됩니다.

Reported performance는 sampling/recycling count, MSA setup, benchmark filtering, OOM/failure handling, top-1 vs oracle choice에 영향을 받습니다. Physical validity failure도 남아 있습니다. Overlapping chain, steric clash, ligand geometry issue는 Boltz-1x가 이어서 다루는 문제입니다.

Design pipeline에서의 실제 위치

Boltz-1을 protein design pipeline에 넣는다면, 가장 자연스러운 위치는 generation 이후의 structural evaluator입니다. RFdiffusion, BindCraft, BoltzDesign1, BoltzGen 같은 model이 후보를 만들고, Boltz-1류 predictor가 target complex pose와 confidence를 평가합니다. 그 뒤에 physical validity check, affinity predictor, docking/FEP, wet-lab assay가 이어집니다.

이때 Boltz-1이 주는 값은 “이 후보가 실제 binder다”가 아니라 “이 후보가 입력된 biomolecular context에서 plausible한 complex structure를 만들 수 있는가”에 가깝습니다. 이 차이는 작아 보이지만 중요합니다. Structure plausibility는 좋은 first filter일 수 있지만, binding affinity, specificity, expression, developability를 대체하지 않습니다.

평가: open AF3-style infrastructure의 의미

Boltz-1의 가치는 단순히 “AF3와 비슷한 성능”에 있지 않습니다. AF3-style biomolecular interaction prediction을 연구자가 직접 실행하고, 고치고, downstream pipeline에 넣을 수 있는 open infrastructure로 만든 데 있습니다.

Protein design 관점에서는 Boltz-1을 generator보다 evaluator/filter로 보는 편이 맞습니다. Generated candidate가 target과 어떤 complex를 만들지, ligand pose가 plausible한지, confidence가 어떤지 확인하는 layer입니다. 이 layer가 열린 덕분에 BoltzDesign1, Boltz-2, BoltzGen 같은 후속 work가 빠르게 이어질 수 있었습니다.

Boltz-1의 evidence는 structure prediction benchmark와 openness입니다. 이 범위 안에서 보면 Boltz-1은 “open biomolecular interaction modeling infrastructure”라는 표현이 가장 잘 맞습니다.

참고

- Paper: “Boltz-1: Democratizing Biomolecular Interaction Modeling” - Authors: Jeremy Wohlwend, Gabriele Corso, Saro Passaro, Mateo Reveiz, Ken Leidal, Wojtek Swiderski, Tally Portnoi, Itamar Chinn, Jacob Silterra, Tommi Jaakkola, Regina Barzilay - bioRxiv DOI: https://doi.org/10.1101/2024.11.19.624167 - GitHub: https://github.com/jwohlwend/boltz - Raw source: `raw/papers/Boltz-1/boltz1.pdf` - Extracted source: `raw/papers/Boltz-1/extracted/boltz1.txt`