Focus Area: Drug Discovery

June 06,2025

Boltz-2: Towards accurate and efficient binding prediction

Accurately modeling biomolecular interactions is a central challenge in modern biology. While recent advances, such as AlphaFold3 and Boltz-1, have substantially improved our ability to predict biomolecular complex structures, these models still fall short in predicting binding affinity, a critical property underlying molecular function and therapeutic efficacy. Here, we present Boltz-2, a new structural biology foundation model that exhibits strong performance for both structure and affinity prediction. Boltz-2 introduces controllability features including experimental method conditioning, distance constraints, and multi-chain template integration for structure prediction, and is, to our knowledge, the first AI model to approach the performance of free-energy perturbation (FEP) methods in estimating small molecule–protein binding affinity. Crucially, it achieves strong correlation with experimental readouts on many benchmarks, while being at least 1000× more computationally efficient than FEP. By coupling Boltz-2 with a generative model for small molecules, we demonstrate an effective workflow to find diverse, synthesizable, high-affinity binders, as estimated by absolute FEP simulations on the TYK2 target. To foster broad adoption and further innovation at the intersection of machine learning and biology, we are releasing Boltz-2 weights, inference, and training code 1 under a permissive open license, providing a robust and extensible foundation for both academic and industrial research.

Co-authors: Saro Passaro, Gabriele Corso, Jeremy Wohlwend, Mateo Reveiz, Stephan Thaler, Vignesh Ram Somnath, Noah Getz, Tally Portnoi, Julien Roy, Hannes Stark, David Kwabi-Addo, Dominique Beaini, Tommi Jaakkola, Regina Barzilay Learn more

February 06,2025

Protein codes promote selective subcellular compartmentalization

Cells have evolved mechanisms to distribute ~10 billion protein molecules to subcellular compartments where diverse proteins involved in shared functions must assemble. Here, we demonstrate that proteins with shared functions share amino acid sequence codes that guide them to compartment destinations. A protein language model, ProtGPS, was developed that predicts with high performance the compartment localization of human proteins excluded from the training set. ProtGPS successfully guided generation of novel protein sequences that selectively assemble in the nucleolus. ProtGPS identified pathological mutations that change this code and lead to altered subcellular localization of proteins. Our results indicate that protein sequences contain not only a folding code, but also a previously unrecognized code governing their distribution to diverse subcellular compartments.

Contributors: Henry R. Kilgore, Itamar Chinn, Peter G. Mikhael, Ilan Mitnikov, Catherine Van Dongen, Guy Zylberberg, Lena Afeyan, Salman F. Banani, Susana Wilson-Hawken, Tong Ihn Lee, and Richard A. Young Learn more

February 04,2025

Deep learning enhances the prediction of HLA class I-presented CD8+ T cell epitopes in foreign pathogens

Accurate in silico determination of CD8+ T cell epitopes would greatly enhance T cell-based vaccine development, but current prediction models are not reliably successful. Here, motivated by recent successes applying machine learning to complex biology, we curated a dataset of 651,237 unique human leukocyte antigen class I (HLA-I) ligands and developed MUNIS, a deep learning model that identifies peptides presented by HLA-I alleles. MUNIS shows improved performance compared with existing models in predicting peptide presentation and CD8+ T cell epitope immunodominance hierarchies. Moreover, application of MUNIS to proteins from Epstein–Barr virus led to successful identification of both established and novel HLA-I epitopes which were experimentally validated by in vitro HLA-I-peptide stability and T cell immunogenicity assays. MUNIS performs comparably to an experimental stability assay in terms of immunogenicity prediction, suggesting that deep learning can reduce experimental burden and accelerate identification of CD8+ T cell epitopes for rapid T cell vaccine development.

Contributors:: Anusha Nathan, Nitan Shalon, Charles R. Crain, Rhoda Tano-Menka, Benjamin Goldberg, Emma Richards, Gaurav D. Gaiha Learn more

November 22,2024

Boltz-1 Democratizing Biomolecular Interaction Modeling

Understanding biomolecular interactions is fundamental to advancing fields like drug discovery and protein design. In this paper, we introduce Boltz-1, an open-source deep learning model incorporating innovations in model architecture, speed optimization, and data processing achieving AlphaFold3-level accuracy in predicting the 3D structures of biomolecular complexes. Boltz-1 demonstrates a performance on-par with state-of-the-art commercial models on a range of diverse benchmarks, setting a new benchmark for commercially accessible tools in structural biology. By releasing the training and inference code, model weights, datasets, and benchmarks under the MIT open license, we aim to foster global collaboration, accelerate discoveries, and provide a robust platform for advancing biomolecular modeling.

Contributors: Jeremy Wohlwend, Gabriele Corso, Saro Passaro, Mateo Reveiz, Ken Leidal, Wojtek Swiderski, Tally Portnoi, Itamar Chinn, Jacob Silterra Learn more

May 22,2024

MicrobioRaman: an open-access web repository for microbiological Raman spectroscopy data

Here we present the establishment of an open-access web-based repository for microbiological Raman spectroscopy data. The data collection, called ‘MicrobioRaman’ (https://www.ebi.ac.uk/biostudies/MicrobioRaman/studies), was inspired by the great success and usefulness of research databases such as GenBank and UniProt. This centralized repository, residing within the BioStudies database — which is maintained by a public institution, the European Bioinformatics Institute — minimizes the risk of data loss or eventual abandonment, offering a long-term common reference for analysis with advantages in accessibility and transparency over commercial data analysis tools. We feel that MicrobioRaman will provide a foundation for this growing field by serving as an open-access repository for sharing microbiological Raman data and through the codification of a set of reporting standards. Contributors: Kang Soo Lee, Zachary Landry, Awais Athar, Uria Alcolombri, Pratchaya Pramoj Na Ayutthaya, David Berry, Philippe de Bettignies, Ji-Xin Cheng, Gabor Csucs, Li Cui, Volker Deckert, Thomas Dieing, Jennifer Dionne, Ondrej Doskocil, Glen D’Souza, Cristina García-Timermans, Notburga Gierlinger, Keisuke Goda, Roland Hatzenpichler, Richard Henshaw, Wei Huang, Ievgeniia Iermak, Natalia Ivleva, Janina Kneipp, Patrick Kubryk, Kirsten Küsel, Tae Kwon Lee, Sung Sik Lee, Bo Ma, Clara Martínez-Pérez, Pavel Matousek, Rainer U. Meckenstock, Wei Min, Peter Mojzeš, Oliver Müller, Naresh Kumar, Per Halkjær Nielsen, Ioan Notingher, Márton Palatinszky, Fátima C. Pereira, Giuseppe Pezzotti, Zdenek Pilat, Filip Plesinger, Jürgen Popp, Alexander Probst, Alessandra Riva, Amr. Saleh, Ota Samek, Haley Sapers, Olga Schubert, Astrid Stubbusch, Gordon Taylor, Michael Wagner, Jing Wang, Huabing Yin, Yang Yue, Renato Zenobi, Jacopo Zini, Ugis Sarkans & Roman Stocker. Learn more

May 01,2024

DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents

Diffusion models (DMs) have revolutionized generative learning. They utilize a diffusion process to encode data into a simple Gaussian distribution. However, encoding a complex, potentially multimodal data distribution into a single continuous Gaussian distribution arguably represents an unnecessarily challenging learning problem. We propose Discrete-Continuous Latent Variable Diffusion Models (DisCo-Diff) to simplify this task by introducing complementary discrete latent variables. We augment DMs with learnable discrete latents, inferred with an encoder, and train DM and encoder end-to-end. DisCo-Diff does not rely on pre-trained networks, making the framework universally applicable. The discrete latents significantly simplify learning the DM's complex noise-to-data mapping by reducing the curvature of the DM's generative ODE. An additional autoregressive transformer models the distribution of the discrete latents, a simple step because DisCo-Diff requires only few discrete variables with small codebooks. We validate DisCo-Diff on toy data, several image synthesis tasks as well as molecular docking, and find that introducing discrete latents consistently improves model performance. For example, DisCo-Diff achieves state-of-the-art FID scores on class-conditioned ImageNet-64/128 datasets with ODE sampler.

Contributors: Yilun Xu, Gabriele Corso, Arash Vahdat, Karsten Kreis Learn more

May 01,2024

Causal Discovery with Fewer Conditional Independence Tests

Many questions in science center around the fundamental problem of understanding causal relationships. However, most constraint-based causal discovery algorithms, including the well-celebrated PC algorithm, often incur an exponential number of conditional independence (CI) tests, posing limitations in various applications. Addressing this, our work focuses on characterizing what can be learned about the underlying causal graph with a reduced number of CI tests. We show that it is possible to a learn a coarser representation of the hidden causal graph with a polynomial number of tests. This coarser representation, named Causal Consistent Partition Graph (CCPG), comprises of a partition of the vertices and a directed graph defined over its components. CCPG satisfies consistency of orientations and additional constraints which favor finer partitions. Furthermore, it reduces to the underlying causal graph when the causal graph is identifiable. As a consequence, our results offer the first efficient algorithm for recovering the true causal graph with a polynomial number of tests, in special cases where the causal graph is fully identifiable through observational data and potentially additional interventions.

Contributors: Kirankumar Shiragur, Jiaqi Zhang Learn more

May 01,2024

Harmonic Self-Conditioned Flow Matching for joint Multi-Ligand Docking and Binding Site Design

A significant amount of protein function requires binding small molecules, including enzymatic catalysis. As such, designing binding pockets for small molecules has several impactful applications ranging from drug synthesis to energy storage. Towards this goal, we first develop HarmonicFlow, an improved generative process over 3D protein-ligand binding structures based on our self-conditioned flow matching objective. FlowSite extends this flow model to jointly generate a protein pocket's discrete residue types and the molecule's binding 3D structure. We show that HarmonicFlow improves upon state-of-the-art generative processes for docking in simplicity, generality, and average sample quality in pocket-level docking. Enabled by this structure modeling, FlowSite designs binding sites substantially better than baseline approaches.

: Hannes Stark, Bowen Jing Learn more

March 24,2024

CLIPZyme: Reaction-Conditioned Virtual Screening of Enzymes

Computational screening of naturally occurring proteins has the potential to identify efficient catalysts among the hundreds of millions of sequences that remain uncharacterized. Current experimental methods remain time, cost and labor intensive, limiting the number of enzymes they can reasonably screen. In this work, we propose a computational framework for in-silico enzyme screening. Through a contrastive objective, we train CLIPZyme to encode and align representations of enzyme structures and reaction pairs. With no standard computational baseline, we compare CLIPZyme to existing EC (enzyme commission) predictors applied to virtual enzyme screening and show improved performance in scenarios where limited information on the reaction is available (BEDROC of 44.69%). Additionally, we evaluate combining EC predictors with CLIPZyme and show its generalization capacity on both unseen reactions and protein clusters.

Contributor: Itamar Chinn Learn more

March 04,2024

Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design”

Combining discrete and continuous data is an important capability for generative models. We present Discrete Flow Models (DFMs), a new flow-based model of discrete data that provides the missing link in enabling flow-based generative models to be applied to multimodal continuous and discrete data problems. Our key insight is that the discrete equivalent of continuous space flow matching can be realized using Continuous Time Markov Chains. DFMs benefit from a simple derivation that includes discrete diffusion models as a specific instance while allowing improved performance over existing diffusion-based approaches. We utilize our DFMs method to build a multimodal flow-based modeling framework. We apply this capability to the task of protein co-design, wherein we learn a model for jointly generating protein structure and sequence. Our approach achieves state-of-the-art co-design performance while allowing the same multimodal model to be used for flexible generation of the sequence or structure.

Contributors: Andrew Campbell, Jason Yim, Tom Rainforth Learn more