BVP, AI 시대 biology-native data infrastructure 제시

Bessemer Venture Partners(BVP)가 AI drug discovery의 다음 경쟁력이 단일 모델 성능보다 biology-native data infrastructure에서 나온다는 thesis를 제시함. 글은 대규모 biological dataset, agentic R&D workflow, closed-loop lab automation을 차세대 AI biotech의 핵심 운영 기반으로 봄. 긴 시장 분석 글에 가까우며, AI biology 모델의 Cambrian explosion 이후 어떤 회사가 지속 가능한 moat를 만들 수 있는지에 초점을 둠.

요약

•

Bessemer Venture Partners(BVP)는 미국 기반 venture capital firm으로, 이 글은 AI/ML과 biotech 교차 영역에서 어떤 인프라 회사와 full-stack biotech이 장기적으로 유리한지 정리한 투자 thesis에 가까움.

•

글의 출발점은 drug development의 병목이 hypothesis 부족이 아니라 hypothesis를 빠르고 효율적으로 검증할 수 있는 resource와 data infrastructure 부족이라는 주장임. Target ID부터 clinical candidate까지 여전히 5년 이상 걸리고, clinical trial에 들어간 drug의 약 90%가 실패한다는 문제의식이 제시됨.

•

AI drug discovery가 실제 임상 성과로 이어지기 시작한 사례로 Insilico Medicine의 TNIK inhibitor rentosertib Phase IIa 결과를 언급함. 글에 따르면 target discovery와 molecule design이 모두 generative AI로 수행된 first-in-class small molecule 사례로 소개됨.

•

최근 시장 신호로 GSK-NOETIK, Eli Lilly-Chai Discovery, Isomorphic Labs의 pharma partnership, Anthropic의 Coefficient Bio 인수 등을 제시함. 이는 large pharma와 frontier AI lab이 AI-driven drug discovery를 단순한 software tooling이 아니라 strategic capability로 보기 시작했다는 맥락임.

•

BVP가 제시하는 핵심 원칙은 세 가지임. biology-native data at scale, agentic AI across R&D workflows, closed-loop lab automation을 갖춘 회사가 AI biology 시대의 durable advantage를 만들 수 있다고 봄.

•

Biology-native data는 단순히 큰 dataset이 아니라 drug mechanism과 development decision에 맞는 modality, scale, fidelity, context를 가진 데이터로 설명됨. PDB, UniProt, ChEMBL 같은 public data는 강력하지만 static structure와 sequence 중심이라 downstream drug property 학습에는 빈틈이 있다고 지적함.

•

특히 PDB의 bias가 강조됨. 안정적이고 soluble하며 crystallization 가능한 protein이 과대표집되어 있고, membrane protein, intrinsically disordered protein, transient complex, dynamic conformational ensemble은 상대적으로 부족함. Allosteric site나 ligand-induced conformation처럼 druggability에 중요한 상태가 static snapshot에 잘 잡히지 않는다는 문제임.

•

Hit-to-lead와 development candidate 단계에서는 binding 확인만으로 충분하지 않음. Developability, immunogenicity, off-target effect, thermostability, solubility, aggregation propensity 같은 property가 필요하지만, 이런 항목은 public high-quality supervised dataset이 부족하다고 설명함.

•

Agentic AI workflow는 literature, patent, database, bioinformatics pipeline, experiment planning, report writing을 연결하는 R&D operating layer로 제시됨. 특정 model 하나보다 새 model과 tool을 빠르게 test하고 orchestration할 수 있는 modular infrastructure가 중요하다는 관점임.

•

Closed-loop lab automation은 in silico design output을 wet-lab experiment로 빠르게 검증하고, 그 결과를 다시 model과 decision loop에 넣는 구조를 뜻함. 글은 design-test-make-analyze cycle이 lead optimization에서 최대 3년까지 걸릴 수 있고, CRO outsourcing이 iteration speed와 data consistency를 제한할 수 있다고 봄.

•

Lab automation 쪽 사례로 Medra, Automata, Dash Bio, Lila Sciences 등이 언급됨. 방향성은 robot control, instrument orchestration, automated CRO, fully automated lab처럼 서로 다르지만, 공통 목표는 biological learning loop의 cycle time을 줄이는 것임.

•

AI-bio 관점에서 이 글의 핵심 함의는 moat가 단일 foundation model에서 biological learning loop로 이동한다는 것임. Proprietary data generation, agentic R&D workflow, automated wet-lab feedback을 묶어 더 빠르게 design-test-learn cycle을 돌리는 회사가 장기적으로 유리하다는 주장으로 읽힘.

원문

https://www.bvp.com/atlas/building-biology-native-data-infrastructure-for-the-ai-era