MultiFlow 논문 리뷰

들어가며

Backbone-first pipeline의 빈칸

Discrete Flow Model: discrete state에서 flow 만들기

MultiFlow: sequence DFM + structure FrameFlow

Evidence: self-consistency 중심의 co-design benchmark

Distillation이 만든 성능

Inverse folding과 forward folding

MultiFlow가 말하는 co-design의 어려움

RFdiffusion, Chroma와의 위치

평가: co-design formulation의 좋은 기준점

참고

MultiFlow 논문 리뷰

들어가며

Protein generator 논문을 보다 보면 반복해서 나오는 pipeline이 있습니다. 먼저 backbone structure를 만들고, 그 backbone에 맞는 sequence를 ProteinMPNN 같은 inverse folding model로 붙입니다. 그다음 ESMFold나 AlphaFold2로 다시 접어 self-consistency를 확인합니다. RFdiffusion 이후 많은 protein design workflow가 이런 backbone-first 구조를 공유합니다.

MultiFlow는 이 구조에 다른 질문을 던집니다. Protein은 sequence와 structure가 함께 정해지는 물체인데, 왜 structure를 먼저 만들고 sequence는 후처리로 붙여야 할까. Amino-acid sequence는 discrete token이고, backbone structure는 continuous geometry입니다. 이 둘을 하나의 generative process 안에서 함께 다룰 수 있을까.

2024년 arXiv에 공개된 “Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design”은 이 질문에 대한 방법론 논문입니다. MultiFlow는 discrete amino-acid sequence에는 Discrete Flow Model을, continuous backbone structure에는 FrameFlow/Yim et al. 계열 SE(3) flow를 결합해 sequence/structure co-generation을 시도합니다. 이 논문은 wet-lab binder design paper가 아니라, protein co-design formulation을 다루는 method paper에 가깝습니다.

Backbone-first pipeline의 빈칸

Backbone-first pipeline은 실용적입니다. Backbone generator가 가능한 fold나 interface geometry를 만들고, ProteinMPNN이 그 backbone을 구현할 sequence를 찾습니다. 이후 structure prediction model로 generated backbone과 predicted structure가 맞는지 확인합니다. 이 흐름은 RFdiffusion, FrameDiff, Genie 계열에서 반복됩니다.

하지만 이 방식에는 빈칸이 있습니다. Protein sequence와 structure는 독립된 단계가 아닙니다. 어떤 sequence가 가능한지는 backbone geometry에 의해 제한되고, 어떤 backbone이 안정적인지는 sequence distribution에 의해 제한됩니다. Backbone을 먼저 만들고 sequence를 나중에 붙이는 방식은 이 coupled nature를 후단 inverse folding과 filtering에 맡깁니다.

MultiFlow는 이 coupling을 model 내부로 가져오려 합니다. Sequence와 structure를 동시에 생성하면, model이 sequence-space와 structure-space의 상호 의존성을 더 직접적으로 배울 수 있다는 기대입니다. 물론 이것이 곧바로 더 좋은 실험 성능을 뜻하지는 않습니다. 논문이 보여주는 evidence도 대부분 self-consistency metric입니다. 그래도 formulation 자체는 중요한 전환점입니다.

Discrete Flow Model: discrete state에서 flow 만들기

Flow matching은 보통 continuous data에 잘 맞습니다. Noise distribution에서 data distribution으로 이동하는 vector field를 학습하고, sampling 때 그 flow를 따라갑니다. Protein backbone coordinate처럼 continuous geometry를 다룰 때는 자연스럽습니다. 문제는 amino-acid sequence입니다. Sequence는 20개 amino-acid token과 mask 같은 discrete state로 이루어져 있습니다.

MultiFlow의 기술적 핵심은 Discrete Flow Model입니다. 논문은 discrete data의 probability flow를 Continuous Time Markov Chain trajectory로 구성합니다. Noise-to-data interpolation을 discrete state-space에서 정의하고, denoising network를 cross-entropy objective로 학습해 그 trajectory를 따르게 합니다.

여기서 CTMC stochasticity는 sampling 중 sequence transition 수를 조절하는 knob가 됩니다. 논문은 discrete diffusion model을 특정 stochasticity setting의 특수한 경우로 볼 수 있다고 설명합니다. 즉 MultiFlow의 DFM은 discrete sequence generation을 continuous flow matching family와 연결하려는 시도입니다.

이 방법론 자체는 protein에만 묶이지 않습니다. 논문 제목도 discrete state-space generative flows를 전면에 놓습니다. Protein co-design은 그 application입니다. 다만 protein design 관점에서는, 이 기술이 “sequence와 structure를 같은 flow family 안에서 묶는다”는 점이 가장 중요합니다.

MultiFlow: sequence DFM + structure FrameFlow

Protein application에서 MultiFlow는 두 modality를 결합합니다. Sequence는 20 amino acids plus mask state로 표현하고, structure는 residue local frame, translation, rotation으로 표현합니다. Structure 쪽은 FrameFlow/Yim et al. 계열의 SE(3) flow를 사용합니다.

Training에서는 sequence noise level과 structure noise level을 독립적으로 sample합니다. 이 선택 덕분에 inference에서 여러 mode를 쓸 수 있습니다. 둘 다 noise에서 시작하면 co-generation입니다. Structure를 주고 sequence를 생성하면 inverse folding입니다. Sequence를 주고 structure를 생성하면 forward folding입니다. 한 framework 안에서 여러 conditional sampling mode를 다룰 수 있다는 점이 MultiFlow의 장점입니다.

이 구조는 Chroma와도 흥미롭게 대비됩니다. Chroma는 backbone prior와 ChromaDesign을 factorize하고, conditioner framework로 geometry/semantic 조건을 조합합니다. MultiFlow는 sequence와 structure를 더 직접적으로 하나의 multimodal flow 안에 넣으려 합니다. 둘 다 “protein generation을 단일 backbone generator 이상으로 확장한다”는 흐름에 있지만, Chroma는 programmable conditional generation system에 가깝고, MultiFlow는 discrete-continuous multimodal flow formulation에 가깝습니다.

Evidence: self-consistency 중심의 co-design benchmark

MultiFlow의 protein experiment는 PDB protein 18,684개, 길이 60–384 aa를 사용합니다. Evaluation의 중심은 self-consistency입니다. Generated sequence를 ESMFold 또는 AlphaFold2로 다시 접고, 그 predicted structure가 generated structure와 얼마나 가까운지 scRMSD로 봅니다. scRMSD < 2 Å이면 designable로 간주합니다.

Table 3에서 MultiFlow는 co-design 1 setting에서 designability 0.88, diversity 143, novelty 0.68을 보고합니다. 같은 table에서 Protein Generator는 co-design 1 designability 0.34, Protpardelle은 0.05로 제시됩니다. PMPNN 8 setting에서는 MultiFlow designability가 0.99까지 올라가고, diversity 156, novelty 0.68로 보고됩니다.

이 숫자는 MultiFlow가 sequence/structure co-design self-consistency에서 꽤 강하다는 점을 보여줍니다. 하지만 metric의 성격을 놓치면 안 됩니다. Designability는 generated sequence가 prediction model에서 generated structure로 다시 접히는지 보는 in silico proxy입니다. 실제 expression, solubility, folding, function, binding을 직접 측정한 것이 아닙니다.

또 하나 중요한 점은 PMPNN setting입니다. MultiFlow 자체가 sequence/structure co-generation을 주장하지만, ProteinMPNN sequence를 쓰는 setting에서 designability가 더 올라갑니다. 이는 structure-only quality나 backbone generation capability를 보여주는 동시에, sequence co-design이 여전히 어려운 문제라는 점도 보여줍니다.

Distillation이 만든 성능

MultiFlow 결과에서 가장 중요한 caveat는 distillation입니다. 논문은 distillation 단계에서 ProteinMPNN으로 원래 sequence를 대체하거나, MultiFlow-generated structure 중 ProteinMPNN self-consistency를 통과한 4,179 examples를 추가합니다. 이 단계가 성능에 크게 영향을 줍니다.

Table 3에서 distillation 없는 MultiFlow는 co-design designability 0.41로 떨어집니다. Distillation이 들어간 full MultiFlow는 0.88입니다. 차이가 큽니다. 따라서 MultiFlow의 strong result는 raw PDB training만으로 얻은 co-design 능력이라기보다, ProteinMPNN/ESMFold-style oracle과 self-training이 들어간 pipeline result로 봐야 합니다.

이 caveat는 논문의 가치를 낮춘다기보다, co-design이 얼마나 어려운지 보여줍니다. Sequence와 structure를 동시에 생성하는 formulation은 매력적이지만, 실제 성능은 아직 external inverse-folding model과 structure-prediction proxy에 많이 기대고 있습니다. Backbone-first pipeline의 한계를 넘으려는 논문이지만, 그 평가와 distillation에는 여전히 backbone-first 생태계의 도구들이 들어갑니다.

Inverse folding과 forward folding

MultiFlow는 co-generation뿐 아니라 inverse folding과 forward folding도 테스트합니다. Table 4에서 inverse folding은 ProteinMPNN scRMSD 1.9 ± 2.7, MultiFlow 2.2 ± 2.6으로 비교적 근접합니다. Structure가 주어졌을 때 sequence를 만드는 task에서는 MultiFlow가 ProteinMPNN에 아주 멀지는 않다는 의미입니다.

반면 forward folding은 약합니다. ESMFold RMSD 2.7 ± 3.9에 비해 MultiFlow는 15.3 ± 4.5로 크게 뒤집니다. Sequence가 주어졌을 때 structure를 예측하는 문제에서는 specialized structure predictor와 격차가 큽니다. 논문도 MultiFlow를 모든 task에서 SOTA 모델로 내세우기보다, general-purpose multimodal generative model로 가는 경로로 제시합니다.

이 결과는 자연스럽습니다. Forward folding은 AlphaFold/ESMFold류 모델이 막대한 data와 architecture로 최적화한 문제입니다. MultiFlow가 discrete-continuous co-generation framework를 제시했다고 해서 바로 structure prediction model을 대체하는 것은 아닙니다. 이 논문의 중심은 “한 model이 여러 conditional mode를 다룰 수 있다”이지, “각 task specialist를 모두 이긴다”가 아닙니다.

MultiFlow가 말하는 co-design의 어려움

MultiFlow는 co-design이라는 말의 매력을 잘 보여주면서도, 그 어려움도 같이 보여줍니다. Sequence와 structure를 함께 생성하는 것은 이론적으로 자연스럽습니다. Protein은 sequence-only나 structure-only object가 아니기 때문입니다. 하지만 discrete token과 continuous geometry를 하나의 probability path로 묶는 것은 기술적으로 어렵고, 평가도 쉽지 않습니다.

현재 MultiFlow의 evidence는 self-consistency 중심입니다. scRMSD < 2 Å, novelty, diversity는 useful metric이지만, 실험적 foldability나 function을 말해주지는 않습니다. 특히 binder design에서는 target interface, affinity, specificity, developability, wet-lab validation까지 따로 봐야 합니다. MultiFlow는 이 영역까지 들어가지는 않습니다.

그 대신 MultiFlow는 protein design pipeline의 구조적 질문을 던집니다. 앞으로 protein generator가 계속 backbone-first + inverse folding 조합으로 갈 것인가, 아니면 sequence와 structure를 함께 움직이는 co-design model로 갈 것인가. MultiFlow는 후자의 가능성을 method level에서 보여주는 논문입니다.

RFdiffusion, Chroma와의 위치

RFdiffusion은 target-conditioned backbone/interface generation을 practical design task로 밀고 간 논문입니다. Backbone을 만들고 ProteinMPNN과 AF2 filtering을 거쳐 wet-lab binder validation으로 이어집니다. 이 점에서 RFdiffusion은 design pipeline의 실험적 impact가 강합니다.

Chroma는 protein prior와 conditioner framework를 결합한 programmable generation system입니다. Backbone, sequence, sidechain을 factorize하고, symmetry, motif, shape, semantic condition을 generation time에 걸 수 있습니다. Wet-lab validation은 target binding보다 foldability와 structural accuracy에 집중합니다.

MultiFlow는 이 둘과 다른 축에 있습니다. Protein design을 programmable하게 만드는 것보다, discrete sequence와 continuous structure를 하나의 flow framework 안에서 함께 생성하는 formulation이 중심입니다. 그래서 MultiFlow는 실전 binder design milestone이라기보다, co-design model lineage의 방법론 anchor로 두는 편이 맞습니다.

평가: co-design formulation의 좋은 기준점

MultiFlow의 장점은 명확합니다. Discrete Flow Model로 amino-acid sequence를 다루고, continuous structure flow와 결합해 sequence/structure co-generation을 구현했습니다. Training noise level을 modality별로 독립적으로 조절해 co-generation, inverse folding, forward folding을 같은 framework에서 다룰 수 있게 했습니다. Protein design에서 discrete-continuous multimodal generation이 왜 필요한지 잘 보여주는 논문입니다.

한계도 분명합니다. Wet-lab validation은 없습니다. Sidechain atoms도 직접 모델링하지 않습니다. Binder/interface design, motif scaffolding, ligand context, functional design으로 이어지는 evidence도 이 논문의 중심이 아닙니다. Strong co-design result는 distillation과 external oracle에 크게 의존합니다. Forward folding에서는 specialized predictor와 큰 격차가 있습니다.

그래서 MultiFlow는 “지금 바로 실험 후보를 뽑는 pipeline”이라기보다, protein co-design을 어떻게 수식화할지에 대한 방법론적 기준점입니다. Backbone-first design pipeline의 빈칸을 설명할 때, 그리고 sequence/structure co-generation이 왜 어려운지 이야기할 때 좋은 논문입니다. 성능 claim은 self-consistency proxy의 범위 안에서 읽는 편이 안전합니다.

참고

•

Campbell, A. et al. “Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design”, arXiv:2402.04997, 2024.

•

Code: https://github.com/jasonkyuyim/multiflow

•

주요 비교 축: RFdiffusion, Chroma, FrameDiff, ProteinMPNN, Protein Generator, Protpardelle.