Inspired by the effectiveness of genetic algorithms and the importance of synthesizability in molecular design, we present SynGA, a simple genetic algorithm that operates directly over synthesis routes. Our method features custom crossover and mutation operators that explicitly constrain it to synthesizable molecular space. By modifying the fitness function, we demonstrate the effectiveness of SynGA on a variety of design tasks, including synthesizable analog search and sample-efficient property optimization, for both 2D and 3D objectives. Furthermore, by coupling SynGA with a machine learning-based filter that focuses the building block set, we boost SynGA to state-of-the-art performance. For property optimization, this manifests as a model-based variant SynGBO, which employs SynGA and block filtering in the inner loop of Bayesian optimization. Since SynGA is lightweight and enforces synthesizability by construction, our hope is that SynGA can not only serve as a strong standalone baseline but also as a versatile module that can be incorporated into larger synthesis-aware workflows in the future.
Co-authors: Alston Lo, Connor W. Coley, Wojciech Matusik Learn more
Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% on GSM8K, 2.6% on MATH500, 18.4% on Countdown, and 27.0% on Sudoku.
Co-authors: Chenyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shannon Zejiang Shen, Feiyu Chen, Tommi Jaakkola, Yuandong Tian, Bo Liu Learn more
The performance of flow matching and diffusion models can be greatly improved at inference time using reward adaptation algorithms, yet efficiency remains a major limitation. While several algorithms were proposed, we demonstrate that a common bottleneck is the sampling method these algorithms rely on: many algorithms require to sample Markov transitions via SDE sampling, which is significantly less efficient and often less performant than ODE sampling. To remove this bottleneck, we introduce GLASS Flows, a new sampling paradigm that simulates a ''flow matching model within a flow matching model'' to sample Markov transitions. As we show in this work, this ''inner'' flow matching model can be retrieved from any pre-trained model without any re-training, effectively combining the efficiency of ODEs with the stochastic evolution of SDEs. On large-scale text-to-image models, we show that GLASS Flows eliminate the trade-off between stochastic evolution and efficiency. GLASS Flows improve state-of-the-art performance in text-to-image generation, making it a simple, drop-in solution for inference-time scaling of flow and diffusion models.
Co-authors: Peter Holderrieth, Uriel Singer, Tommi Jaakkola, Ricky T. Q. Chen, Yaron Lipman, Brian Karrer Learn more
Progress and potential
Blending polymers is a cost-effective strategy to develop functional materials using existing components, yet the design space is vast, and traditional trial-and-error approaches are inefficient. In this work, we introduce an autonomous, data-driven workflow integrated with a robotic platform for discovering functional random heteropolymer blends. This system successfully identified blends that outperform their individual components in protein stabilization. While previous efforts have focused primarily on the monomer composition of random heteropolymers, our results highlight the potential to make discoveries from complex polymer blend systems. This methodology could be generalized to other material discovery campaigns, from optimizing electrolytes for batteries to improving drug excipient combinations. The dataset released with this study also provides a valuable resource for advancing polymer informatics in blend design.
Highlights
• A data-driven robotic platform was developed to discover functional polymer blends
• The platform enabled efficient optimization from high-dimensional blending spaces
• Blends of random heteropolymers can outperform individual components in function
• Segment-level features correlated with improved protein stabilization
Contributors: Guangqi Wu, Tianyi Jin, Alfredo Alexander-Katz, Connor Coley
Learn more
Designing new enzymes typically begins with idealized arrangements of catalytic functional groups around a reaction transition state, then attempts to generate protein structures that precisely position these groups. Current AI-based methods can create active enzymes but require predefined residue positions and rely on reverse-building residue backbones from side-chain placements, which limits design flexibility. Here we show that a new deep generative model, RoseTTAFold diffusion 2 (RFdiffusion2), overcomes these constraints by designing enzymes directly from functional group geometries without specifying residue order or performing inverse rotamer generation. RFdiffusion2 successfully generates scaffolds for all 41 active sites in a diverse benchmark, compared to 16 using previous methods. We further design enzymes for three distinct catalytic mechanisms and identify active candidates after experimentally testing fewer than 96 sequences in each case. These results highlight the potential of atomic-level generative modeling to create de novo enzymes directly from reaction mechanisms.
Contributors:Woody Ahern, Jason Yim, Doug Tischer, Saman Salike, Seth M. Woodbury, Donghyo Kim, Indrek Kalvet, Yakov Kipnis, Brian Coventry, Han Raut Altae-Tran, Magnus S. Bauer, Regina Barzilay, Tommi S. Jaakkola, Rohith Krishna, David Baker Learn more
We introduce BoltzGen, an all-atom generative model for designing proteins and peptides across all modalities to bind a wide range of biomolecular targets. BoltzGen builds strong structural reasoning capabilities about target-binder interactions into its generative design process. This is achieved by unifying design and structure prediction, resulting in a single model that also reaches state-of-the-art folding performance. BoltzGen’s generation process can be controlled with a flexible design specification language over covalent bonds, structure constraints, binding sites, and more. We experimentally validate these capabilities in a total of eight diverse wetlab design campaigns with functional and affinity readouts across 26 targets. The experiments span binder modalities from nanobodies to disulfide-bonded peptides and include targets ranging from disordered proteins to small molecules. For instance, we test 15 nanobody and protein binder designs against each of nine novel targets with low similarity to any protein with a known bound structure. For both binder modalities, this yields nanomolar binders for 66% of targets. We release model weights, data, and both inference and training code at: https://github.com/HannesStark/boltzgen.
Co-authors: Hannes Stark, Felix Faltings, MinGyu Choi, Yuxin Xie, Eunsu Hur,
Timothy O’Donnell, Anton Bushuiev, Talip Uçar, Saro Passaro, Weian Mao, Mateo Reveiz, Roman Bushuiev, Tomáš Pluska, Josef Sivic, Karsten Kreis, Arash Vahdat, Shamayeeta Ray, Jonathan T. Goldstein, Andrew Savinov, Jacob A. Hambalek, Anshika Gupta, Diego A. Taquiri-Diaz, Yaotian Zhang, A. Katherine Hatstat, Angelika Arada, Nam Hyeong Kim, Ethel Tackie-Yarboi, Dylan Boselli, Lee Schnaider, Chang C. Liu, Gene-Wei Li, Denes Hnisz, David M. Sabatini, William F. DeGrado, Jeremy Wohlwend, Gabriele Corso, Regina Barzilay, Tommi Jaakkola Learn more
Recent advances in artificial intelligence (AI) have propelled materials discovery by identifying unique composition pathways at unprecedented speed. However, experimental characterization—the step where new materials are actually tested—still lags behind. Traditional characterization requires specialized instruments that measure electromagnetic responses in a painstaking, expert-driven process. SpectroGen offers a transformative solution. By coupling physics-inspired distribution models (e.g., Gaussians and Lorentzians) with a robust variable autoencoder framework, SpectroGen rapidly generates “virtual” spectra that correlate almost perfectly with actual measurements. This approach effectively bridges the gap between AI-driven materials discovery and real-world verification. SpectroGen’s universal compatibility also makes it flexible: any spectroscopy technique that can be represented by analytic functions may be harnessed within its platform.
The potential impact is substantial. High-throughput screening—vital for developing next-generation catalysts, batteries, superconductors, and pharmaceuticals—can now be accelerated without sacrificing accuracy. Researchers stand to gain significant time and resource savings, as they can prioritize the most promising candidate materials for detailed follow-up. This synergy of fast AI-driven discovery and swift AI-enabled characterization could catalyze breakthroughs vital to society, from clean energy solutions to advanced medical treatments. Beyond accelerating fundamental research, SpectroGen’s capacity for rapid prototyping and validation is poised to reshape how we innovate, ultimately translating into critically needed technologies that better serve humanity.
Contributors: Yanmin Zhu, Loza F. Tadesse Learn more
Current clinical antibiotics are largely broad-spectrum agents that can alter the gut microbiome and promote colonization by Enterobacteriaceae, which are often drug resistant. This includes adherent-invasive Escherichia coli (AIEC), particularly in patients with inflammatory bowel disease, in which dysbiosis creates a niche for this pathogen to colonize. There is an urgent and unmet need for novel narrow-spectrum and microbiome-sparing antibiotics. Here we screened 10,747 bioactive small molecules for antibacterial activity against AIEC and discovered enterololin, an antibacterial compound with targeted activity against Enterobacteriaceae species. Enterololin could overcome intrinsic and acquired resistance mechanisms in clinical isolates when combined with a subinhibitory concentration of SPR741, a polymyxin B analogue used here to increase outer membrane permeability in Gram-negative bacteria. Molecular substructure- and deep learning-guided mechanism-of-action investigations revealed that enterololin perturbs lipoprotein trafficking through a mechanism involving the LolCDE complex, laboratory-evolved resistant mutants predominantly mapped to lolC and lolE, with an in vitro frequency of resistance of ~10−8 to 10−7. Enterololin showed low mammalian cytotoxicity (HEK293 half-maximal inhibitory concentration ~100 µg ml−1) and suppressed AIEC infection in mouse models when administered in combination with SPR741, while largely preserving the overall microbiome composition. This study highlights the utility of deep learning methods for predicting molecular interactions and identifies a promising Enterobacteriaceae-specific antibacterial candidate for further development.
Co-authors: Denise B. Catacutan, Vian Tran, Autumn Arnold, Jeremie Alexander, Gabriele Corso, Yeganeh Yousefi, Megan M. Tu, Stewart McLellan, Dominique Tertigas, Jakob Magolan, Michael G. Surette, Eric D. Brown, Brian K. Coombes Learn more
Advances in artificial intelligence (AI) are fueling a new paradigm of discoveries in natural sciences. Today, AI has started to advance natural sciences by improving, accelerating, and enabling our understanding of natural phenomena at a wide range of spatial and temporal scales, giving rise to a new area of research known as AI for science (AI4Science). Being an emerging research paradigm, AI4Science is unique in that it is an enormous and highly interdisciplinary area. Thus, a unified and technical treatment of this field is needed, yet challenging. This work aims to provide a technically thorough account of a subarea of AI4Science; namely, AI for quantum, atomistic, and continuum systems. These areas aim at understanding the physical world from the subatomic (wavefunctions and electron density), atomic (molecules, proteins, materials, and interactions), to macro (fluids, climate, and subsurface) scales and form an important subarea of AI4Science. A unique advantage of focusing on these areas is that they largely share a common set of challenges, thereby allowing a unified and foundational treatment. A key common challenge is how to capture physics first principles, especially symmetries, in natural systems by deep learning methods. We provide an in-depth yet intuitive account of techniques to achieve equivariance to symmetry transformations. We also discuss other common technical challenges, including explainability, out-of-distribution generalization, knowledge transfer with foundation and large language models, and uncertainty quantification. To facilitate learning and education, we provide categorized lists of resources that we found to be useful. We strive to be thorough and unified and hope this initial effort may trigger more community interests and efforts to further advance AI4Science. Co-authors: Xuan Zhang, Limei Wang, Jacob Helwig, Youzhi Luo, Cong Fu, Yaochen Xie, Meng Liu, Yuchao Lin, Zhao Xu, Keqiang Yan, Keir Adams, Maurice Weiler, Xiner Li, Tianfan Fu, Yucheng Wang, Alex Strasser, Haiyang Yu, YuQing Xie, Xiang Fu, Shenglong Xu, Yi Liu, Yuanqi Du, Alexandra Saxton, Hongyi Ling, Hannah Lawrence, Hannes Stärk, Shurui Gui, Carl Edwards, Nicholas Gao, Adriana Ladera, Tailin Wu, Elyssa F. Hofgard, Aria Mansouri Tehrani, Rui Wang, Ameya Daigavane, Montgomery Bohde, Jerry Kurtin, Qian Huang, Tuong Phung, Minkai Xu, Chaitanya K. Joshi, Simon V. Mathis, Kamyar Azizzadenesheli, Ada Fang, Alán Aspuru-Guzik, Erik Bekkers, Michael Bronstein, Marinka Zitnik, Anima Anandkumar, Stefano Ermon, Pietro Liò, Rose Yu, Stephan Günnemann, Jure Leskovec, Heng Ji, Jimeng Sun, Regina Barzilay, Tommi Jaakkola, Connor W. Coley, Xiaoning Qian, Xiaofeng Qian, Tess Smidt and Shuiwang Ji Learn more
We develop ProxelGen, a protein structure generative model that operates on 3D densities as opposed to the prevailing 3D point cloud representations. Representing proteins as voxelized densities, or \textit{proxels}, enables new tasks, conditioning capabilities, and a straightforward path for employing convolutional model architectures with different inductive biases than previous generative models. We generate proteins encoded as proxels via a 3D CNN-based VAE in conjunction with a diffusion model operating on its latent space. Compared to state-of-the-art models, ProxelGen's samples achieve higher novelty and better FID scores while maintaining designability of the training set. ProxelGen's advantages are demonstrated in a standard motif scaffolding benchmark, and we show how 3D density-based generation allows for more flexible shape conditioning.
Contributors: Felix Faltings, Hannes Stärk, Regina Barzilay, Tommi Jaakkola Learn more