LigandMPNN 논문 리뷰

ProteinMPNN 리뷰에서 핵심 메시지는 단순했습니다. Backbone을 만들었다고 protein을 만든 것은 아닙니다. 실제로 합성하고 발현하는 것은 3차원 좌표가 아니라 amino acid sequence이기 때문에, generated backbone을 foldable sequence로 구현하는 단계가 필요합니다.

그런데 ProteinMPNN만으로는 아직 빠지는 맥락이 있습니다. ProteinMPNN은 backbone geometry를 보고 sequence를 설계하는 데 강하지만, 기본적으로 protein backbone atom만 봅니다. Small molecule binder, enzyme active site, DNA/RNA-binding protein, metal-binding site처럼 주변의 ligand, nucleotide, metal atom이 residue identity를 직접 좌우하는 문제에서는 중요한 정보가 빠져 있습니다.

LigandMPNN 논문, “Atomic context-conditioned protein sequence design using LigandMPNN”은 이 간극을 다룹니다. ProteinMPNN의 inverse folding 프레임을 유지하되, protein 주변의 non-protein atomic context를 입력으로 넣어 sequence와 sidechain conformation을 설계합니다. ProteinMPNN이 “이 backbone을 어떤 sequence로 만들까?”였다면, LigandMPNN은 “이 ligand/base/metal을 고려하면 주변 residue는 어떤 sequence와 sidechain이어야 할까?”를 묻는 모델입니다.

Backbone-only sequence design의 한계

Backbone-only inverse folding은 foldability를 다루는 데 매우 유용합니다. RFdiffusion류 backbone generator가 구조 후보를 만들면, ProteinMPNN이 그 구조를 sequence로 구현하고, AF2/ESMFold가 다시 접힘 가능성을 확인합니다. 이 흐름은 현대 protein design pipeline의 기본 골격에 가깝습니다.

하지만 protein function은 빈 공간에서 생기지 않습니다. Small-molecule binder에서는 ligand pose와 donor/acceptor pattern을 봐야 합니다. DNA-binding protein에서는 base와 phosphate backbone geometry가 중요합니다. Metal-binding site에서는 coordination geometry와 element identity가 residue choice를 강하게 제한합니다. Enzyme active site에서는 catalytic residue와 transition-state-like geometry까지 맞아야 합니다.

이런 문제를 backbone만 보고 sequence로 구현하려 하면 ligand 주변의 신호가 빠집니다. Ligand 주변에 hydrogen bond donor가 필요할지, aromatic stacking이 필요할지, metal coordination을 위해 histidine이나 cysteine이 필요한지는 ligand나 metal atom의 위치와 chemistry를 봐야 알 수 있습니다. LigandMPNN의 출발점은 바로 이 지점입니다.

Protein graph에 atomic context graph를 붙이기

LigandMPNN은 ProteinMPNN의 기본 구조를 크게 버리지 않습니다. Protein residue를 node로 보고, N, Cα, C, O, virtual Cβ 사이의 거리와 geometry를 edge feature로 쓰는 message-passing neural network라는 큰 틀은 유지합니다.

여기에 non-protein context atom을 별도로 넣습니다. 논문은 세 가지 graph를 함께 사용합니다. 첫째, ProteinMPNN과 유사한 protein-only graph입니다. 둘째, residue 주변의 가까운 context atom들 사이를 연결하는 intra-ligand graph입니다. 셋째, protein residue와 context atom 사이의 geometry를 전달하는 protein-ligand graph입니다.

Baseline 설정에서는 residue당 가까운 context atom 25개를 사용합니다. Context atom에는 small molecule, nucleotide, metal, 그리고 필요한 경우 fixed sidechain atom이 들어갈 수 있습니다. 각 context atom은 chemical element type과 거리 정보로 표현되고, protein atom-context atom distance와 angle-based feature가 함께 들어갑니다.

LigandMPNN은 ligand를 단순 label로 넣지 않고, 실제 atom 위치와 chemical type을 graph message passing 안으로 끌어들입니다. 그래서 ligand 주변 residue identity가 ligand geometry에 맞게 바뀔 수 있습니다.

[Figure 1] Architecture: ProteinMPNN의 확장

Figure 1에서 LigandMPNN은 ProteinMPNN의 자연스러운 확장처럼 배치됩니다. Protein backbone graph는 그대로 두고, 주변 atomic context를 별도 graph로 encoding한 뒤 protein residue representation에 통합합니다. Sequence decoding은 ProteinMPNN처럼 random autoregressive 방식으로 진행됩니다.

이 구조 덕분에 ProteinMPNN의 장점도 상당 부분 유지됩니다. Fixed residue나 fixed chain을 context로 넣을 수 있고, partial sequence design이나 multichain design에도 쓸 수 있습니다. 다만 이제 context가 protein backbone에만 머물지 않고 ligand, nucleotide, metal, fixed sidechain atom까지 넓어집니다.

또 하나의 확장은 sidechain packing입니다. LigandMPNN은 sequence만 생성하는 것이 아니라 chi1–chi4 sidechain torsion angle distribution도 autoregressive하게 예측합니다. 즉 “어떤 amino acid를 넣을까”와 “그 sidechain이 ligand를 향해 어떤 conformation을 가져야 할까”를 함께 다룹니다.

[Figure 2] Sequence recovery: ligand, nucleotide, metal 주변에서 차이가 커진다

Figure 2는 LigandMPNN의 가장 직접적인 benchmark입니다. 평가 기준은 non-protein context atom으로부터 5.0 Å 이내에 있는 residue의 native sequence recovery입니다. 이 범위는 ligand/base/metal과 직접 상호작용할 가능성이 높은 위치를 보는 설정입니다.

결과는 선명합니다. Small molecule 주변에서는 Rosetta 50.4%, ProteinMPNN 50.5%, LigandMPNN 63.3%입니다. Nucleotide 주변에서는 Rosetta 35.2%, ProteinMPNN 34.0%, LigandMPNN 50.5%입니다. Metal 주변에서는 Rosetta 36.0%, ProteinMPNN 40.6%, LigandMPNN 77.5%입니다.

이 숫자를 보면 LigandMPNN이 backbone-only inverse folding에서 빠지는 신호를 상당 부분 회복한다는 점이 보입니다. 특히 metal 주변에서 차이가 큽니다. Metal coordination은 element identity와 geometry가 residue choice를 강하게 제한하기 때문에, context atom의 chemical type을 명시적으로 넣는 효과가 크게 나타난 것으로 읽을 수 있습니다.

다만 native sequence recovery는 design success와 같은 말은 아닙니다. 자연 구조의 residue를 더 잘 맞힌다는 것은 context-aware representation이 residue identity를 더 잘 설명한다는 뜻입니다. Binding affinity, specificity, catalytic activity, expression 같은 실험적 성공은 별도의 evidence layer입니다.

[Figure S1] Ablation: 무엇이 실제로 중요한가

Supplementary ablation은 LigandMPNN을 실용적으로 이해하는 데 중요합니다. Context atom 수를 줄이면 특히 nucleotide 주변 성능이 떨어집니다. Nucleotide는 small molecule이나 metal보다 크고 atom 수가 많기 때문에, 적은 context atom으로는 interface 주변 정보를 충분히 덮기 어렵기 때문으로 볼 수 있습니다.

Protein-ligand graph와 ligand graph를 제거하면 sequence recovery가 약 3% 감소하고, ligand graph만 제거하면 약 1% 감소합니다. Chemical element type을 제거하면 metal 주변 성능이 가장 크게 나빠집니다. Small molecule이나 nucleotide에서는 geometry만으로 어느 정도 element identity를 추론할 수 있지만, metal에서는 원소 자체가 coordination chemistry와 직접 연결되기 때문입니다.

흥미로운 결과도 있습니다. Sidechain-only context로 학습한 모델도 small molecule 주변 sequence recovery가 3–4% 정도만 떨어집니다. 논문은 sidechain과 small molecule 사이에 carbon/oxygen/nitrogen chemistry와 volume exclusion 측면의 유사성이 있기 때문에 어느 정도 generalization이 가능하다고 해석합니다. 이 결과는 LigandMPNN이 ligand-specific memorization만 하는 모델은 아니라는 쪽의 간접 근거로 볼 수 있습니다.

[Figure 3] Sidechain packing: sequence 다음의 문제

LigandMPNN은 sidechain conformation도 예측합니다. 이 부분은 small-molecule binder나 enzyme design에서 특히 중요합니다. 같은 amino acid가 들어가도 sidechain이 ligand를 향하지 않으면 hydrogen bond나 metal coordination은 만들어지지 않습니다.

논문은 context atom으로부터 5.0 Å 이내 residue에 대해 chi angle recovery를 비교합니다. Chi1 recovery는 small molecule 주변에서 Rosetta 76.0%, ProteinMPNN 83.3%, LigandMPNN 86.1%입니다. Nucleotide 주변에서는 66.2%, 65.6%, 71.4%, metal 주변에서는 68.6%, 76.7%, 79.3%로 보고됩니다.

이 결과는 LigandMPNN이 sidechain packing에서도 개선을 보인다는 뜻입니다. 하지만 차이가 sequence recovery만큼 압도적인 것은 아닙니다. 논문도 sidechain packing 정보의 상당 부분은 ligand context보다 protein context에서 온다고 해석합니다. 특히 chi3/chi4처럼 distal torsion으로 갈수록 모든 model이 어려워집니다. 따라서 LigandMPNN의 sidechain output은 유용한 후보 평가 신호이지만, 정밀한 pose validation 자체로 읽기는 어렵습니다.

[Figure 4] Experimental validation: 약한 small-molecule binder를 rescue하기

Figure 4는 LigandMPNN이 실제 design problem에서 맡는 역할을 보여주는 섹션입니다. 논문은 이미 Rosetta로 설계됐지만 약하거나 실패한 small-molecule binder를 LigandMPNN으로 redesign합니다. Target은 rocuronium과 cholic acid입니다.

Rocuronium binder redesign

Rocuronium은 muscle relaxant입니다. Supplementary Methods에서 이 실험은 이전 연구의 rocuronium binder design 2,119개에서 출발합니다. 이 set은 yeast surface display와 fluorescence-activated cell sorting으로 이미 screening되었고, 일부 hit는 있었지만 여기서 다시 다루는 설계들은 기존 sequence로는 원하는 binding을 충분히 보이지 못한 후보들입니다.

LigandMPNN은 원래 backbone과 ligand binding pose를 context로 받아 각 input마다 temperature 0.1에서 8개 sequence를 생성합니다. 그 다음 새 sequence를 RosettaFastRelax로 relax하고, ligand hydrogen bond, Rosetta ddG < -30.0, Rosetta contact molecular surface > 200, AlphaFold pLDDT, Cα-RMSD, sidechain RMSD 같은 filter를 거칩니다. 최종적으로 189 designs를 eBlock으로 주문해 yeast display flow cytometry로 테스트합니다.

Assay는 1 µM biotinylated rocuronium과 streptavidin-PE signal을 읽는 yeast display binding assay입니다. 이 readout은 display된 설계 단백질이 rocuronium probe와 결합하는지를 보는 1차 binding evidence입니다. Figure 4A는 LigandMPNN redesign 이후 일부 후보에서 binding signal이 rescue되는 모습을 보여줍니다.

이 결과에서 읽을 수 있는 것은 LigandMPNN이 ligand context를 이용해 기존 실패/약한 design의 sequence를 더 ligand-compatible하게 바꿀 수 있다는 점입니다. 다만 LigandMPNN 단독으로 pocket을 새로 만들거나, filtering 없이 곧바로 binder를 낸 사례는 아닙니다. Backbone과 ligand pose는 주어져 있고, 실험 후보는 여러 computational filters를 거쳐 선택되었습니다.

Cholic acid binder redesign

Cholic acid 쪽은 affinity improvement 사례에 가깝습니다. Supplement에 따르면 crystal structure가 있는 micromolar CHD binder에서 출발했고, reported Kd는 5.3 µM입니다. LigandMPNN과 Rosetta는 각각 1,000 sequences를 생성했고, LigandMPNN 쪽에서는 105개 sequence를 무작위로 골라 주문했습니다.

실험은 yeast surface display binding과 fluorescence polarization으로 이어집니다. Main figure의 fluorescence polarization measurement는 cholic acid-FITC와 purified or displayed binder 사이의 binding을 읽어 affinity 변화를 보는 assay입니다. 논문은 LigandMPNN redesign이 cholic acid binder affinity를 약 100-fold 높였다고 보고합니다.

여기서도 독자가 주의해서 볼 지점은 denominator와 출발 조건입니다. LigandMPNN은 ligand pose와 existing binder backbone을 전제로 sequence를 개선했습니다. 따라서 이 결과는 context-aware sequence redesign의 성공 사례로 강하지만, de novo pocket generation evidence로 읽기는 어렵습니다.

DNA-binding protein과 broader usage

LigandMPNN 논문은 small molecule뿐 아니라 DNA-binding protein design과도 연결됩니다. 저자들은 sequence-specific DNA-binding protein design에서 LigandMPNN을 protein-DNA interface sequence design에 사용했고, 하나의 crystal structure가 design model과 잘 맞아 PDB 8TAC로 deposited되었다고 설명합니다.

Published abstract 기준으로는 LigandMPNN이 100개 이상의 experimentally validated small-molecule/DNA-binding proteins와 네 개의 X-ray crystal structures로 support된다고 요약됩니다. 이 표현에서는 모델이 다양한 atomic context design workflow에 실제로 쓰였다는 점이 드러납니다. 다만 각 use case의 target, denominator, assay, filtering depth는 서로 다를 수 있기 때문에, 하나의 universal hit rate처럼 읽기보다는 여러 design campaign에서 쓰인 sequence-design component로 보는 편이 안전합니다.

ProteinMPNN에서 LigandMPNN으로

ProteinMPNN은 backbone-conditioned sequence design의 실용적 기준점이었습니다. RFDiffusion 같은 backbone generator가 구조 후보를 만들면, ProteinMPNN이 그 구조를 sequence로 구현하고, AF2/ESMFold가 다시 접힘 가능성을 확인합니다.

LigandMPNN은 이 흐름을 functional site 쪽으로 한 단계 밀어붙입니다. 이제 sequence design은 backbone만 보는 문제가 아니라, ligand, nucleotide, metal, fixed sidechain atom을 함께 보는 문제가 됩니다. ProteinMPNN이 “foldable sequence”를 묻는다면, LigandMPNN은 “context-compatible sequence와 sidechain”을 묻습니다.

이 차이는 RFdiffusion2/3 같은 후속 흐름과도 잘 맞습니다. RFdiffusion2는 catalytic atom geometry를 조건으로 active-site scaffold를 만들고, RFdiffusion3는 ligand/DNA/motif atom을 포함한 all-atom scene을 생성하려고 합니다. 이런 generator가 만든 backbone이나 active-site context를 실제 sequence로 구현하는 단계에서 LigandMPNN류 모델의 역할이 커집니다.

Context-aware sequence design의 범위

LigandMPNN의 강점은 명확합니다. Protein backbone만 보던 sequence design을 non-protein atomic context까지 확장했습니다. Small molecule, nucleotide, metal 주변 native sequence recovery가 크게 올라갔고, sidechain packing도 개선되었으며, rocuronium/cholic acid redesign과 DNA-binding protein examples가 실험적으로 연결됩니다.

하지만 이 결과는 LigandMPNN이 pocket을 처음부터 만들거나, ligand pose를 찾거나, binding/function을 자동으로 보장한다는 뜻은 아닙니다. LigandMPNN은 주어진 backbone과 context coordinates를 더 나은 sequence와 sidechain conformation으로 구현하는 모델입니다. Upstream에서는 pocket/scaffold generation과 ligand pose가 중요하고, downstream에서는 Rosetta/AF2/geometric filtering과 wet-lab assay가 여전히 결과 해석의 핵심입니다.

한계점

첫째, LigandMPNN은 backbone generator가 아닙니다. 이 논문 결과에서 읽을 수 있는 것은, 주어진 backbone과 ligand/nucleotide/metal context가 있을 때 sequence design이 더 context-aware해진다는 점입니다. Pocket backbone이나 ligand pose 자체의 품질은 별도의 evidence layer로 보아야 합니다.

둘째, native sequence recovery와 chi angle recovery는 useful proxy입니다. 이 benchmark는 모델이 자연 구조 주변 residue identity와 sidechain conformation을 더 잘 재현한다는 것을 말해줍니다. 하지만 binding affinity, specificity, catalytic activity, cellular function까지 바로 말해주지는 않습니다.

셋째, rare chemistry에서는 조심해서 읽어야 합니다. 논문은 PDB에서 거의 없거나 드문 element를 포함한 compound에서는 성능 해석이 더 불확실해질 수 있고, 아예 없는 element는 가장 가까운 occurring element로 mapping한다고 설명합니다. 즉 LigandMPNN의 context-awareness는 training distribution 안에서 가장 설득력이 큽니다.

넷째, metal context의 강한 성능은 장점이면서 해석상의 경계선이기도 합니다. Element type을 제거했을 때 metal 주변 성능이 크게 떨어진다는 것은, 모델이 metal identity를 중요한 signal로 사용한다는 뜻입니다. 동시에 training distribution 밖의 metal coordination chemistry에서는 이 benchmark만으로 같은 수준의 신뢰도를 기대하기 어렵습니다.

평가

LigandMPNN은 ProteinMPNN 다음에 읽기 가장 좋은 논문입니다. ProteinMPNN이 backbone을 sequence로 바꾸는 실용적 다리였다면, LigandMPNN은 그 다리에 ligand, nucleotide, metal이라는 실제 functional context를 얹습니다.

제가 보기에는 이 논문의 의미는 “small molecule binder를 자동으로 만든다”가 아닙니다. 더 정확한 의미는 sequence design이 더 이상 backbone-only problem으로 충분하지 않은 영역을 선명하게 보여준다는 데 있습니다. Enzyme, sensor, small-molecule binder, DNA-binding protein에서는 residue identity가 주변 atom-level context와 함께 결정됩니다. LigandMPNN은 이 사실을 ProteinMPNN 계열의 deep inverse folding 안으로 가져온 모델입니다.

그래서 LigandMPNN은 단독 주연이라기보다 pipeline의 핵심 조연에 가깝습니다. RFdiffusion이나 Rosetta가 pocket/scaffold 후보를 만들고, LigandMPNN이 ligand-aware sequence와 sidechain을 제안하고, AF2/Rosetta/geometric filters가 후보를 줄이고, 마지막으로 assay가 binding이나 function을 확인합니다. Protein design이 backbone generation에서 atomic context-aware design으로 넘어가는 흐름을 이해하려면, LigandMPNN은 꼭 짚고 가야 하는 논문입니다.

참고

•

Dauparas et al. 2025, Nature Methods, “Atomic context-conditioned protein sequence design using LigandMPNN”, DOI: 10.1038/s41592-025-02626-1

•

Code: https://github.com/dauparas/LigandMPNN

•

Wiki 참고: LigandMPNN, ProteinMPNN, Sequence Design, Context-aware Design, Candidate Filtering, Wet-lab Validation