Accurate registration of brain MRI scans is fundamental for cross-subject analysis in neuroscientific studies. This involves aligning both the cortical surface of the brain and the interior volume. Traditional methods treat volumetric and surface-based registration separately, which often leads to inconsistencies that limit downstream analyses. We propose a deep learning framework, UCS, that registers 3D brain MRI images by jointly aligning both cortical and subcortical regions, through a unified volume-and-surface-based representation. Our approach leverages an intermediate spherical coordinate space to bridge anatomical surface topology with volumetric anatomy, enabling consistent and anatomically accurate alignment. By integrating spherical registration into the learning, our method ensures geometric coherence between volume and surface domains. In a series of experiments on both in-domain and out-of-domain datasets, our method consistently outperforms both classical and machine learning-based registration methods--improving the Dice score by up to 7 points while maintaining regular deformation fields. Additionally, it is orders of magnitude faster than the standard method for this task, and is simpler to use because it requires no additional inputs beyond an MRI scan. Its superior accuracy, fast inference, and ease of use sets a new standard for joint cortical and subcortical registration.
Co-authors: Mazdak Abulnaga, Andrew Hoopes, Malte Hoffmann, Robin Magnet, Maks Ovsjanikov, Lilla Zollei, John Guttag, Bruce Fischl, Adrian V Dalca Learn more
Generative models of 3D cardiovascular anatomy can synthesize informative structures for clinical research and medical device evaluation, but face a trade-off between geometric controllability and realism. We propose CardioComposer: a programmable, inference time framework for generating multi-class anatomical label maps from interpretable ellipsoidal primitives. These primitives represent geometric attributes such as the size, shape, and position of discrete substructures. We specifically develop differentiable measurement functions based on voxel-wise geometric moments, enabling loss-based gradient guidance during diffusion model sampling. We demonstrate that these losses can constrain individual geometric attributes in a disentangled manner and provide compositional control over multiple substructures. Finally, we show that our method is compatible with a broad range of anatomical systems containing non-convex substructures, spanning cardiac, vascular, and skeletal organs. We release our code at https://github.com/kkadry/CardioComposer.
Co-authors: Karim Kadry, Shoaib A. Goraya, Ajay Manicka, Abdalla Abdelwahed, Naravich Chutisilp, Farhad R. Nezami, Elazer R Edelman Learn more
Large language models (LLMs) can be prompted with specific styles (e.g., formatting responses as lists), including in malicious queries. Prior jailbreak research mainly augments these queries with additional string transformations to maximize attack success rate (ASR). However, the impact of style patterns in the original queries that are semantically irrelevant to the malicious intent remains unclear. In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment. We first define ASR inflation as the increase in ASR due to style patterns in existing jailbreak benchmark queries. By evaluating
LLMs across seven benchmarks, we find that nearly all models exhibit ASR inflation. Notably, the inflation correlates with an LLM's relative attention to style patterns, which also overlap more with its instruction-tuning data when inflation occurs. We then investigate superficial style alignment, and find that fine-tuning with specific styles makes LLMs more vulnerable to jailbreaks of those same styles. Finally, we propose SafeStyle, a defense strategy that incorporates a small amount of safety training data augmented to match the distribution of style patterns in the fine-tuning data. Across three LLMs, six fine-tuning style settings, and two real-world instruction-tuning datasets, SafeStyle consistently outperforms baselines in maintaining LLM safety.
Co-authors: Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Menon Suriyakumar, Marzyeh Ghassemi Learn more
Large language models (LLMs) often produce confident yet incorrect responses, and uncertainty quantification is one potential solution to more robust usage. Recent works routinely rely on self-consistency to estimate aleatoric uncertainty (AU), yet this proxy collapses when models are overconfident and produce the same incorrect answer across samples. We analyze this regime and show that cross-model semantic disagreement is higher on incorrect answers precisely when AU is low. Motivated by this, we introduce an epistemic uncertainty (EU) term that operates in the black-box access setting: EU uses only generated text from a small, scale-matched ensemble and is computed as the gap between inter-model and intra-model sequence-semantic similarity. We then define total uncertainty (TU) as the sum of AU and EU. In a comprehensive study across five 7-9B instruction-tuned models and ten long-form tasks, TU improves ranking calibration and selective abstention relative to AU, and EU reliably flags confident failures where AU is low. We further characterize when EU is most useful via agreement and complementarity diagnostics.
Co-authors: Kimia Hamidieh, Veronika Thost, Walter Gerych, Mikhail Yurochkin, Marzyeh Ghassemi Learn more
Vision-Language models (VLMs), including CLIP, are known to encode biases such as learning spurious correlations that falsely associate background attributes with particular labels. Debiasing approaches typically aim to isolate and remove subspaces corresponding to a target concept via projecting the embedding away from the concept. This strategy succeeds in debiasing VLM embeddings with respect to the concepts considered but can amplify biased shortcuts in unconsidered concepts. In practice, it is impossible to enumerate all possible biases, meaning that an increase in bias can go unobserved during evaluation. We propose a debiasing approach for a set of known concepts such that the relation to the remaining, unconsidered, concepts is minimally changed. We achieve this by rotating the VLM's embeddings within only a relevant subspace, rather than removing these subspaces, which mitigates unintended bias amplification.
Co-authors: Walter Gerych, Cassandra Parent, Quinn Perian, Rafiya Javed, Justin Solomon, Marzyeh Ghassemi Learn more
Methods
We developed a deep learning model to Predict changes in left ventricULar Systolic function from Electrocardiograms (ECG) of patients who have Heart Failure (PULSE-HF). The model integrates 12-lead ECG waveforms with a patient's history of prior LVEF measurements to calculate the likelihood that the LVEF will be less than 40% during the year after the ECG is obtained. The model is retrospectively developed and tested using data from one hospital and externally validated on retrospective cohorts from two different hospitals. The internal development data was collected between January 1, 2000, and June 30, 2021. The external validation datasets were collected between January 1, 2000, and June 30, 2021 at one hospital and between 2008 and 2019 at the other hospital. Findings
PULSE–HF demonstrates strong discriminatory ability with respect to forecasting whether the LVEF would be below 40% within the next year, achieving areas under the receiver operating characteristic curve (AUROC) of 87.5–91.4% across all three HF cohorts. Among patients with HF who have a baseline LVEF above 40%, PULSE–HF effectively identified those at risk of worsening LVEF with AUROCs of 81.6–86.3% across all three datasets. PULSE–HF's discriminatory ability remained consistently high across a range of subgroups with different comorbidities and regardless of medical therapy. Assuming an underlying prevalence of LVEF worsening of 10% per year, PULSE-HF's negative predictive values are over 97%, assuming an underlying sensitivity of 80%. Lastly, we demonstrate that a lead I version of PULSE–HF has a performance similar to the performance of the model that uses all 12 ECG leads. Interpretation
PULSE–HF robustly predicts worsening LVEF in patients who have a prior diagnosis of HF. The method provides a platform for identifying patients who are at an increased risk of worsening systolic dysfunction. Co-authors: Teya Bergamaschi, Tiffany Yau, Payal Chandak, Abena Kyereme-Tuah, Judy Hung, Hanna Gaggin, Isaac S Kohane, Collin M Stultz Learn more
Background
Physical restraints are widely used in intensive care units (ICUs) despite uncertain clinical benefit and risks. We aimed to characterise patterns of restraint use, demographic and clinical predictors, and temporal trends before and after introduction of federal restraint-related reporting requirements.
Methods
We conducted a retrospective cross-sectional study of 51,838 adults admitted to ICUs at Beth Israel Deaconess Medical Center, Boston, MA, USA, between 2008 and 2022, using data from the Medical Information Mart for Intensive Care IV (MIMIC-IV) electronic health record repository. Primary outcome was the proportion of ICU days with documented physical restraint use. Associations between restraint use and demographic and clinical factors were estimated using a binomial generalised linear model with a logit link. Propensity score matching compared Black and White patients under varying adjustment specifications.
Findings
Among 51,838 patients (mean age 63.8 years; 57% male), 21,091 (40.7%) experienced restraint. Use increased from 36.9% in 2008–10 to 44.0% in 2020–22 (p < 0.0001). Asian (aOR 0.84, 95% CI 0.79–0.89) and Hispanic/Latino patients (aOR 0.87, 95% CI 0.83–0.92) had lower odds of restraint than White patients. Propensity score matching between Black and White patients revealed ethnic patterns were highly sensitive to model specification: excluding demographic characteristics revealed significant disparities, which were attenuated when psychiatric diagnoses were also excluded. Matched White patients were not representative of all White ICU patients but rather a subset resembling Black patients on observed characteristics.
Contributors: Maximin Lange, Leo A. Celi, Ben Carter, Jesse D. Raffa, Sharon C. O'Donoghue, Marzyeh Ghassemi, Tom J. Pollard Learn more
Access to accurate predictions of patients’ outcomes can enhance decision making within healthcare institutions. Hartford HealthCare has been collaborating with academics and consultants to predict short- and medium-term outcomes for all inpatients across their seven hospitals. We develop machine learning models that predict the probabilities of next 24-hour/48-hour discharge and intensive care unit transfers, end-of-stay mortality, and discharge dispositions. All models achieve high out-of-sample area under the receiver operating curve (75.7%–92.5%) and are well calibrated. In addition, combining 48-hour discharge predictions with doctors’ predictions simultaneously enables more patient discharges (10%–28.7%) and fewer 7-day/30-day readmissions (p < 0.001). We implement an automated pipeline that extracts data and updates predictions every morning, as well as user-friendly software and a color-coded alert system to communicate these patient-level predictions to clinical teams. Since its deployment, more than 200 doctors, nurses, and case managers across seven hospitals have been using the tool in their daily patient review process. With our tool, we find that doctors start the administrative discharge process earlier, leading to a significant reduction in the average length of stay (0.63 days per patient). We anticipate substantial financial benefits (between $52 and $67 million annually) for the healthcare system.
Contributors: Liangyuan Na, Kimberly Villalobos Carballo, Jean Pauphilet, Ali Haddad-Sisakht, Daniel Kombert, Melissa Boisjoli-Langlois, Andrew Castiglione, Maram Khalifa, Pooja Hebbal, Barry Stein, Dimitris Bertsimas Learn more
Cancer patients often lack timely education and personalized support due to clinician workload. This quality improvement study develops and evaluates a Large Language Model (LLM) agent, MedEduChat, which is integrated with the clinic’s electronic health records (EHR) and designed to enhance prostate cancer patient education. Fifteen non-metastatic prostate cancer patients and three clinicians recruited from the Mayo Clinic interacted with the agent between May 2024 and April 2025. Findings showed that MedEduChat has a high usability score (UMUX = 83.7/100) and improves patients’ health confidence (Health Confidence Score rose from 9.9 to 13.9). Clinicians evaluated the patient-chat interaction history and rated MedEduChat as highly correct (2.9/3), complete (2.7/3), and safe (2.7/3), with moderate personalization (2.3/3). This study highlights the potential of LLM agents to improve patient engagement and health education.
Contributors:Yuexing Hao, Jason Holmes, Mark R. Waddle, Brian J. Davis, Nathan Y. Yu, Kristin S. Vickers, Heather Preston, Drew Margolin, Corinna E. Löckenhoff, Aditya Vashistha, Saleh Kalantari, Marzyeh Ghassemi, and Wei Liu Learn more
For an LLM to correctly respond to an instruction it must understand both the semantics and the domain (i.e., subject area) of a given task-instruction pair. However, syntax can also convey implicit information. Recent work shows that \textit{syntactic templates}---frequent sequences of Part-of-Speech (PoS) tags---are prevalent in training data and often appear in model outputs. In this work we characterize syntactic templates, domain, and semantics in task-instruction pairs. We identify cases of spurious correlations between syntax and domain, where models learn to associate a domain with syntax during training; this can sometimes override prompt semantics. Using a synthetic training dataset, we find that the syntactic-domain correlation can lower performance (mean 0.51 +/- 0.06) on entity knowledge tasks in OLMo-2 models (1B-13B). We introduce an evaluation framework to detect this phenomenon in trained models, and show that it occurs on a subset of the FlanV2 dataset in open (OLMo-2-7B; Llama-4-Maverick), and closed (GPT-4o) models. Finally, we present a case study on the implications for LLM security, showing that unintended syntactic-domain correlations can be used to bypass refusals in OLMo-2-7B Instruct and GPT-4o. Our findings highlight two needs: (1) to explicitly test for syntactic-domain correlations, and (2) to ensure \textit{syntactic} diversity in training data, specifically within domains, to prevent such spurious correlations.
Contributors: Chantal Shaib, Vinith Suriyakumar, Byron Wallace, Marzyeh Ghassemi
Learn more