PLAID 논문 리뷰

들어가며

구조 데이터의 병목을 latent로 우회하기

Figure 1: ESMFold latent를 생성 대상으로 바꾸기

Pfam-scale sequence training과 prompt interface

Unconditional generation: all-atom output의 quality proxy

Figure 4와 5: function과 organism prompt가 만든 패턴

Figure 6: heme-binding으로 wet-lab anchor를 만들다

Ablation: compression과 sampling recipe가 성능을 만든다

ESM3, ProteinGenerator, coordinate-space generator와의 거리

PLAID 논문 리뷰

들어가며

Protein generator를 만들 때 가장 큰 병목 중 하나는 데이터입니다. 구조를 직접 생성하려면 PDB나 AFDB처럼 structure가 붙은 데이터가 필요합니다. 이 데이터는 high-quality geometry를 제공하지만, sequence database가 가진 evolutionary breadth와 annotation density에는 미치지 못합니다. 그런데 구조 데이터는 sequence database보다 훨씬 작고, crystallization bias도 강합니다. 반대로 sequence database는 압도적으로 크고 annotation도 풍부하지만, sequence만으로는 sidechain geometry나 active site arrangement를 직접 제어하기 어렵습니다.

PLAID는 이 간극을 정면으로 우회합니다. 논문 제목은 “Controllable All-Atom Protein Generation with Latent Diffusion”입니다. UC Berkeley, Caltech, Prescient Design / Genentech, Microsoft Research, NYU 연구진이 발표한 bioRxiv preprint이고, code와 weights가 공개되어 있습니다. 핵심 질문은 단순합니다. 이미 sequence에서 structure를 예측하도록 학습된 folding model의 내부 latent space를 생성하면, sequence-scale data로 학습하면서도 all-atom structure를 output으로 얻을 수 있지 않을까?

이 글에서는 PLAID를 coordinate-space all-atom diffusion model이 아니라, folding model latent를 생성 대상으로 삼은 latent-diffusion protein generator로 다룹니다. Pallatom이나 Protpardelle처럼 atom coordinate 자체를 denoise하는 방식과는 다릅니다. ESMFold가 만든 sequence-to-structure representation을 압축하고, 그 compressed latent에서 diffusion을 돌린 뒤, frozen decoder로 sequence와 all-atom structure를 복원합니다.

구조 데이터의 병목을 latent로 우회하기

All-atom protein generation은 backbone generation보다 욕심이 큰 문제입니다. Protein function은 backbone shape만으로 정해지지 않습니다. Sidechain functional group의 위치, packing, charge, hydrogen bond, hydrophobic pattern이 interface와 active site의 실제 behavior를 좌우합니다. 그래서 all-atom generation은 sequence, sidechain, atomistic geometry를 같이 맞추려 합니다.

문제는 학습 데이터입니다. Structure-first generator는 3D motif나 backbone geometry를 다루기 좋지만, 구조 데이터의 크기와 편향에 묶입니다. Sequence-first protein language model은 evolutionary-scale sequence corpus를 활용할 수 있지만, atomistic determinant를 직접 다루기 어렵습니다. PLAID는 여기서 folding model을 하나의 bridge로 사용합니다.

ESMFold는 sequence를 입력받아 내부 latent representation을 만들고, structure module이 이를 all-atom structure로 decode합니다. PLAID는 structure module 직전의 latent를 generative target으로 삼습니다. 이렇게 하면 generative model은 sequence database에서 학습할 수 있고, output은 ESMFold decoder를 거쳐 sequence와 all-atom structure가 됩니다. 논문은 이를 Protein Latent Induced Diffusion이라고 부릅니다.

이 framing은 꽤 영리합니다. 모델이 raw coordinate distribution을 처음부터 배우는 대신, folding model이 이미 학습한 sequence-structure prior 위에서 생성합니다. 대신 약점도 분명합니다. 생성 가능한 structure와 confidence calibration은 ESMFold latent와 decoder의 inductive bias를 물려받습니다. PLAID가 all-atom output을 낸다고 해서, direct coordinate-space all-atom generator와 같은 claim을 하는 것은 아닙니다.

Figure 1: ESMFold latent를 생성 대상으로 바꾸기

Figure 1은 PLAID의 전체 구조를 보여줍니다. 먼저 ESMFold에서 sequence-to-structure latent를 뽑습니다. 이 latent는 sequence information과 structure information이 함께 들어 있는 중간 표현입니다. 다음으로 CHEAP autoencoder를 사용해 latent를 압축합니다. 마지막으로 compressed latent space에서 DiT-style diffusion model을 학습하고, sample된 latent를 frozen sequence decoder와 frozen ESMFold structure decoder에 넣어 sequence와 structure를 복원합니다.

여기서 CHEAP compression은 단순한 engineering detail이 아닙니다. 논문은 ESMFold latent를 그대로 diffusion target으로 쓰면 학습이 불안정했다고 설명합니다. 일부 channel에 massive activation이 있고, noise timestep이 decoded sequence/structure corruption 정도와 부드럽게 대응하지 않았기 때문입니다. CHEAP은 channel dimension을 1024에서 32로 줄이고 length dimension도 2배 downsample합니다. 압축 후에는 latent-space signal-to-noise ratio가 decoded sequence와 structure 품질 변화에 더 잘 맞습니다.

Extended Data Figure 3은 이 선택이 실제로 중요하다는 점을 보여줍니다. CHEAP compression 없이 ESMFold latent를 직접 생성하면 repeated sequence와 낮은 quality가 나타납니다. 따라서 PLAID의 성능은 “latent diffusion”이라는 한 단어로 설명되지 않습니다. Folding-model latent, CHEAP compression, noise schedule, v-prediction, MinSNR, self-conditioning, classifier-free guidance가 함께 작동한 결과입니다.

Pfam-scale sequence training과 prompt interface

PLAID의 generative model은 structure database가 아니라 sequence database로 학습됩니다. 논문은 September 2023 Pfam release를 사용했고, 57,595,205 sequences와 20,795 families를 포함한다고 설명합니다. 이 중 약 46.7%, 즉 24,637,236 sequences가 GO term annotation을 갖습니다. Conditioning에는 2,219 Gene Ontology terms와 3,617 organism labels가 사용됩니다.

이 점이 PLAID의 가장 흥미로운 interface를 만듭니다. Function prompt와 organism prompt를 classifier-free guidance로 compositionally 줄 수 있습니다. 조건은 특정 residue index나 좌표 motif가 아닙니다. “heme binding”, “GPCR activity”, “E. coli” 같은 functional/taxonomic keyword입니다.

Motif scaffolding 계열에서는 보통 원하는 3D motif를 알고 있어야 합니다. Protein binder나 enzyme active site를 설계할 때는 이게 강력하지만, 모든 function을 명확한 motif로 환원할 수 있는 것은 아닙니다. PLAID는 GO-level annotation을 generation condition으로 쓰면서, motif 좌표를 모르는 상태에서도 functional family 방향으로 sample을 밀어볼 수 있는 route를 제시합니다.

물론 prompt가 function을 보장하지는 않습니다. GO term은 broad annotation이고, organism label은 expression이나 developability assay가 아닙니다. 그래도 protein design interface라는 관점에서는 의미가 큽니다. 좌표 motif 대신 semantic function keyword를 넣는 방식은 ESM3의 function token prompt와도 비교할 수 있습니다. 차이는 ESM3가 multimodal token language model인 반면, PLAID는 folding-model latent diffusion이라는 점입니다.

Unconditional generation: all-atom output의 quality proxy

Unconditional benchmark에서 PLAID는 length 64–512 범위의 protein을 생성합니다. 각 target length마다 64 samples를 만들고, 총 n = 3,648 proteins를 평가합니다. 비교 대상은 ProteinGenerator와 Protpardelle 같은 all-atom generation baselines입니다. 평가 축은 cross-consistency, self-consistency, designability, diversity, novelty, biophysical distributional conformity입니다.

Extended Data Table 1에서 PLAID는 all-atom baseline 중 가장 높은 cross-modal consistency를 보입니다. PLAID의 ccTM은 0.69, ccRMSD는 9.47, ccRMSD < 2 Å designability는 0.32로 보고됩니다. ProteinGenerator는 ccTM 0.58, ccRMSD 11.86, designability 0.08이고, Protpardelle은 ccTM 0.44, ccRMSD 24.28, designability 0.00입니다. Natural reference는 ccTM 1.00, ccRMSD 0.07, designability 1.00입니다.

Extended Data Table 2에서는 PLAID가 designable samples 1,171개, designable sequence clusters 809개, designable structure clusters 522개를 만든다고 보고합니다. ProteinGenerator는 309/309/309이고, Protpardelle은 0/0/0입니다. Molecular weight, aromaticity, instability index, pI, hydropathy, charge at pH 7 같은 biophysical attributes의 Wasserstein distance도 대체로 baseline보다 낮습니다.

이 결과는 PLAID가 sequence-scale training과 folding-model latent를 이용해 plausible하고 diverse한 all-atom outputs을 만들 수 있음을 보여줍니다. 특히 baseline 대비 designable cluster 수가 늘어난 점은 단순히 confidence 높은 몇 개 sample에 몰린 결과가 아니라, 어느 정도 quality-diversity tradeoff를 유지했다는 쪽에 가깝습니다. 다만 designability는 generated sequence를 fold했을 때 generated structure와 얼마나 일치하는지 보는 computational proxy입니다. 논문도 서로 다른 sequence의 atom count 차이 때문에 all-atom RMSD 대신 Cα RMSD를 사용한다고 밝힙니다. All-atom generation benchmark지만, experimental all-atom structure validation은 아닙니다.

Figure 4와 5: function과 organism prompt가 만든 패턴

Figure 4는 function-conditioned generation을 다룹니다. 논문은 여러 enzymatic function prompt에서 generated structures가 closest ligand-bound Foldseek hit의 active-site sidechain identities와 orientation을 재현한다고 설명합니다. 흥미로운 지점은 active-site residues가 sequence상 인접하지 않아도, GO-level prompt만으로 non-adjacent sidechain positions가 함께 형성된다는 주장입니다.

Transmembrane transporter activity prompt에서는 hydrophobic exterior와 polar channel interior를 가진 multi-pass helical bundle이 나옵니다. GPCR activity prompt에서는 seven-helix topology와 amphipathic pattern이 나타나고, DeepTMHMM도 generated GPCR-prompted sequences에 대해 seven-helix topology를 예측한다고 보고됩니다.

Figure 5는 organism-conditioned generation입니다. Human, mouse, E. coli, Glycine max 같은 organism prompt를 주면 generated latent embeddings가 t-SNE space에서 real UniRef50 protein distributions와 비슷하게 분리됩니다. Human-conditioned designs의 66.5%는 nearest mmseqs easy-search neighbor를 Homo sapiens에서 갖는다고 보고됩니다.

이 결과들은 PLAID가 annotation-conditioned latent generation을 실제로 어느 정도 수행한다는 evidence입니다. 다만 대부분 in silico, visual, proxy evidence입니다. GO prompt가 active-site-like arrangement를 유도했다는 것은 흥미롭지만, broad enzyme activity가 확인된 것은 아닙니다. Organism prompt 역시 taxonomy-like sequence distribution을 반영한다는 의미에 가깝고, expression host 최적화나 low immunogenicity를 의미하지 않습니다.

Figure 6: heme-binding으로 wet-lab anchor를 만들다

PLAID에서 가장 강한 experimental evidence는 heme-binding protein generation입니다. 논문은 GO keyword “heme-binding”과 target organism E. coli를 prompt로 사용했습니다. Length 120과 160에서 각각 1,000 sequences/structures를 생성했고, 120-residue designs는 Globin-like fold, 160-residue designs는 H-NOX-like fold로 분류했습니다. 또 120-residue PLAID-generated structures 중 designable한 후보에는 ProteinMPNN inverse folding을 적용해 IF Globin group을 만들었습니다.

Filtering은 여러 단계를 거칩니다. 먼저 generated sequence의 designability를 보고, heme coordination에 필요한 axial histidine 위치를 가진 후보를 남깁니다. 이후 Foldseek으로 known heme-binding fold와 TM-score > 0.5인 후보를 고르고, AlphaFold3로 heme-protein joint structure를 예측합니다. Length 120과 160에서는 heme relative protein ipTM threshold를 각각 0.7과 0.6으로 두었고, non-inverse-folded sequence에는 ESMFold pLDDT > 60도 요구했습니다. 마지막으로 90% sequence identity clustering 후 각 cluster에서 ipTM이 높은 후보를 골라 wet-lab set을 만들었습니다.

Experimental validation set은 IF Globin 17개, Globin 34개, H-NOX 9개입니다. E. coli expression 후 lysate UV-vis spectra를 ferric/ferrous 상태에서 측정했습니다. 결과적으로 IF Globin은 상당수가 heme-binding signal을 보였고, directly generated Globin은 일부만 signal을 보였으며, H-NOX designs는 detectable heme binding을 보이지 않았습니다.

대표 후보 IF Globin E6와 Globin A3는 purification 후 다시 확인되었습니다. Purified yield는 Globin A3가 2.5 mg/L, IF Globin E6가 122.5 mg/L였습니다. A3는 concentration 과정에서 precipitation이 있었다고 보고됩니다. 두 purified proteins 모두 bis-histidine heme-bound protein과 일치하는 spectral signature를 보였습니다. Figure 6 caption은 A3가 PLAID에서 직접 생성된 Globin sequence이며, PDB closest known protein 대비 structural novelty TM-score 0.37이라고 설명합니다.

이 heme result는 PLAID가 semantic prompt로 functional protein candidate를 만들 수 있다는 가장 중요한 anchor입니다. 동시에 해석은 조심해야 합니다. UV-vis spectral signature는 heme binding evidence이지만, KD, kinetics, high-resolution experimental structure, designed heme pose validation, specificity, developability를 말해주지는 않습니다. H-NOX group은 실패했고, IF Globin이 direct Globin보다 더 강한 결과를 보인 점도 direct co-generation과 inverse-folding post-processing을 분리해서 봐야 합니다.

Ablation: compression과 sampling recipe가 성능을 만든다

Extended Data Table 3은 PLAID의 recipe가 성능에 미치는 영향을 보여줍니다. Baseline cosine noise schedule과 noise prediction configuration은 ccTM 0.54, scTM 0.55, perplexity 16.97입니다. MinSNR를 더하면 ccTM과 scTM이 0.59로 오릅니다. Sigmoid noise schedule, v-diffusion, MinSNR 조합은 ccTM 0.56, scTM 0.58입니다. 여기에 self-conditioning을 넣은 configuration E가 ccTM 0.70, scTM 0.65, perplexity 15.38로 가장 좋습니다.

Condition dropout을 제거하면 ccTM/scTM이 다시 0.57/0.57로 내려갑니다. 이는 classifier-free guidance를 위한 conditioning dropout이 단지 prompt 기능만을 위한 장치가 아니라, generative quality에도 영향을 준다는 점을 시사합니다. Extended Data Figure 4는 accelerated sampler를 사용하면 20-step DPM++2M 계열 sampling도 가능하다고 보고합니다.

Sampling speed도 PLAID의 장점입니다. Extended Data Table 4에서 600-residue protein 100 samples 기준 PLAID 100M은 batched setting에서 latent sampling 1.64 seconds/sample, structure decode 15.12 seconds/sample입니다. PLAID 2B는 19.34 + 15.07 seconds/sample입니다. 논문은 latent sampling과 decoding을 decouple하거나 latent prefiltering을 할 수 있다고 설명하지만, 공정 비교를 위해 논문에서는 적극적으로 exploit하지 않았다고 밝힙니다.

이 ablation section은 PLAID를 단순히 “ESMFold latent에 diffusion을 붙인 모델”로 축약하면 안 되는 이유입니다. Latent representation, compression, noise schedule, prediction objective, self-conditioning, CFG dropout, sampler가 모두 method의 일부입니다.

ESM3, ProteinGenerator, coordinate-space generator와의 거리

PLAID는 ESM3와 가까워 보이지만 같은 부류는 아닙니다. ESM3는 sequence, structure token, function keyword를 discrete token track으로 함께 다루는 multimodal protein language model입니다. PLAID는 token language modeling이 아니라 ESMFold latent space에서 diffusion을 학습합니다. 둘 다 function-level prompt를 다루지만, ESM3는 token completion route이고 PLAID는 folding-model latent route입니다.

ProteinGenerator와의 차이도 분명합니다. ProteinGenerator는 RoseTTAFold를 sequence-space diffusion model로 fine-tune해 noised amino-acid sequence에서 sequence와 structure를 함께 denoise합니다. Sequence property guidance, multistate design, peptide caging처럼 sequence-side constraint를 generation trajectory에 직접 넣기 좋습니다. PLAID는 sequence token 자체가 아니라 folding-model latent를 생성하므로, generation target과 control interface가 다릅니다.

Pallatom, Protpardelle, Proteina-Atomistica 같은 coordinate/atomistic generator와도 구분해야 합니다. 그쪽은 atom coordinate, residue frame, sidechain representation을 직접 생성 대상으로 삼습니다. PLAID는 frozen ESMFold decoder를 통해 all-atom output을 얻습니다. 그래서 sequence-scale data와 annotation을 쓰는 장점이 있지만, decoder bias와 folding-model confidence calibration의 영향을 피하기 어렵습니다.

이 비교는 PLAID의 위치를 더 선명하게 만듭니다. PLAID는 all-atom generation 경쟁에서 단순히 더 좋은 coordinate denoiser를 제시한 논문이 아닙니다. “이미 학습된 folding model의 latent를 생성하면, sequence-scale training과 atomistic output을 동시에 얻을 수 있다”는 architectural bet에 가깝습니다.

Evidence가 말해주는 범위

PLAID의 evidence는 세 층으로 나누어 보는 것이 편합니다.

첫 번째는 representation-level evidence입니다. ESMFold latent를 CHEAP으로 압축한 뒤 diffusion target으로 삼으면, sequence와 structure가 함께 복원되는 sample을 만들 수 있습니다. Unconditional benchmark에서 PLAID는 baseline보다 높은 cross-consistency와 diversity를 보입니다.

두 번째는 conditioning evidence입니다. GO function prompt와 organism prompt가 active-site-like sidechain arrangement, membrane topology, taxonomic sequence distribution을 유도합니다. 이 층은 semantic prompt가 generation을 움직일 수 있음을 보여주지만, 대부분 computational proxy입니다.

세 번째는 wet-lab evidence입니다. Heme-binding prompt에서 일부 candidate가 UV-vis spectral signature를 보였고, 두 대표 protein은 purification 후에도 heme-bound spectral behavior를 보였습니다. 이 결과는 PLAID를 단순 in silico method보다 한 단계 강하게 만듭니다. 다만 heme-binding이라는 좁은 task이고, hit는 여러 filtering 단계를 통과한 selected candidates입니다.

따라서 PLAID를 “function prompt만으로 원하는 protein을 만든다”고 표현하면 논문이 보여준 범위를 넘어섭니다. 더 정확한 표현은 “folding-model latent를 생성하면, sequence-scale annotation을 all-atom generation interface로 끌어올 수 있고, heme-binding에서 제한적이지만 실제 signal을 얻었다”입니다.

한계점

PLAID는 bioRxiv preprint입니다. Source capture 시점 기준으로 peer review를 거친 논문은 아닙니다. Code와 weights가 공개되어 있다는 점은 장점이지만, reported capability와 reproducibility는 별도로 확인할 여지가 있습니다.

Unconditional generation benchmark는 대부분 computational proxy입니다. ccTM, ccRMSD, scTM, pLDDT, perplexity, Foldseek/MMseqs novelty, biophysical distributional conformity는 useful하지만, expression, solubility, monomericity, activity, binding specificity를 대체하지 않습니다.

Heme-binding validation은 강한 anchor지만 범위가 좁습니다. UV-vis spectral signal은 heme binding을 지지하지만, affinity나 kinetics를 제공하지 않습니다. High-resolution experimental structure도 없어서 designed heme pose가 실제로 구현되었는지는 직접 확인되지 않았습니다. H-NOX-like designs가 실패했고, inverse-folded Globin group이 더 강한 결과를 보였다는 점도 direct co-generation claim을 신중하게 보게 만듭니다.

마지막으로 PLAID는 ESMFold-derived latent와 decoder에 기대는 모델입니다. 이것은 장점이자 제약입니다. Folding model의 structural prior를 재사용하기 때문에 data efficiency와 output plausibility를 얻지만, folding model이 잘 다루지 못하는 region, calibration error, decoder bias도 같이 가져올 수 있습니다.

평가

PLAID의 가장 큰 기여는 all-atom generation 문제를 새로운 좌표계에서 본다는 점입니다. Coordinate를 직접 생성하지 않고, sequence-to-structure predictor가 이미 만들어둔 latent representation을 생성 대상으로 삼습니다. 덕분에 sequence database scale과 GO/organism annotation을 protein generator의 control interface로 끌어옵니다.

이 방향은 특히 motif가 명확하지 않은 function design에서 흥미롭습니다. RFdiffusion류가 3D motif와 target interface를 강하게 다룬다면, PLAID는 “functional keyword에서 출발하는 all-atom candidate generation”이라는 다른 입구를 제시합니다. Heme-binding validation은 이 입구가 완전히 허상은 아니라는 점을 보여줍니다.

하지만 PLAID를 바로 범용 enzyme design platform이나 binder design solution으로 부르기는 어렵습니다. 현재 가장 강한 wet-lab evidence는 heme-binding UV-vis signal이고, 실패한 H-NOX group과 inverse-folded Globin 우위가 함께 보고됩니다. Broad function, affinity, specificity, cellular activity, developability는 별도 evidence가 있어야 안정적으로 말할 수 있습니다.

그래도 이 논문은 all-atom protein generation map에서 중요한 위치를 차지합니다. Direct coordinate diffusion, sequence-space diffusion, multimodal language modeling 사이에 “folding-model latent diffusion”이라는 축을 세웠기 때문입니다. 앞으로 protein design 모델을 비교할 때는 생성 대상이 coordinate인지, token인지, sequence representation인지, 아니면 pretrained predictor의 latent인지부터 구분해야 합니다. PLAID는 그 구분을 꽤 선명하게 만들어주는 논문입니다.

참고

- Lu et al., “Controllable All-Atom Protein Generation with Latent Diffusion,” bioRxiv preprint, DOI: 10.1101/2024.12.02.626353. - Code and weights: https://github.com/amyxlu/plaid, https://huggingface.co/amyxlu/plaid/tree/main. - 비교 맥락: ESM3, ProteinGenerator, Protpardelle, Pallatom, Protein Language Model, All-Atom Generation.