Focus Area: Clinical AI

March 01,2026

Physical restraint use in a United States intensive care unit—a retrospective cross sectional, single center cohort study from 2008 to 2022

Background
Physical restraints are widely used in intensive care units (ICUs) despite uncertain clinical benefit and risks. We aimed to characterise patterns of restraint use, demographic and clinical predictors, and temporal trends before and after introduction of federal restraint-related reporting requirements.

Methods
We conducted a retrospective cross-sectional study of 51,838 adults admitted to ICUs at Beth Israel Deaconess Medical Center, Boston, MA, USA, between 2008 and 2022, using data from the Medical Information Mart for Intensive Care IV (MIMIC-IV) electronic health record repository. Primary outcome was the proportion of ICU days with documented physical restraint use. Associations between restraint use and demographic and clinical factors were estimated using a binomial generalised linear model with a logit link. Propensity score matching compared Black and White patients under varying adjustment specifications.

Findings
Among 51,838 patients (mean age 63.8 years; 57% male), 21,091 (40.7%) experienced restraint. Use increased from 36.9% in 2008–10 to 44.0% in 2020–22 (p < 0.0001). Asian (aOR 0.84, 95% CI 0.79–0.89) and Hispanic/Latino patients (aOR 0.87, 95% CI 0.83–0.92) had lower odds of restraint than White patients. Propensity score matching between Black and White patients revealed ethnic patterns were highly sensitive to model specification: excluding demographic characteristics revealed significant disparities, which were attenuated when psychiatric diagnoses were also excluded. Matched White patients were not representative of all White ICU patients but rather a subset resembling Black patients on observed characteristics.

Contributors: Maximin Lange, Leo A. Celi, Ben Carter, Jesse D. Raffa, Sharon C. O'Donoghue, Marzyeh Ghassemi, Tom J. Pollard Learn more

December 23,2025

Patient Outcome Predictions Improve Operations at Hartford HealthCare

Access to accurate predictions of patients’ outcomes can enhance decision making within healthcare institutions. Hartford HealthCare has been collaborating with academics and consultants to predict short- and medium-term outcomes for all inpatients across their seven hospitals. We develop machine learning models that predict the probabilities of next 24-hour/48-hour discharge and intensive care unit transfers, end-of-stay mortality, and discharge dispositions. All models achieve high out-of-sample area under the receiver operating curve (75.7%–92.5%) and are well calibrated. In addition, combining 48-hour discharge predictions with doctors’ predictions simultaneously enables more patient discharges (10%–28.7%) and fewer 7-day/30-day readmissions (p < 0.001). We implement an automated pipeline that extracts data and updates predictions every morning, as well as user-friendly software and a color-coded alert system to communicate these patient-level predictions to clinical teams. Since its deployment, more than 200 doctors, nurses, and case managers across seven hospitals have been using the tool in their daily patient review process. With our tool, we find that doctors start the administrative discharge process earlier, leading to a significant reduction in the average length of stay (0.63 days per patient). We anticipate substantial financial benefits (between $52 and $67 million annually) for the healthcare system.

Contributors: Liangyuan Na, Kimberly Villalobos Carballo, Jean Pauphilet, Ali Haddad-Sisakht, Daniel Kombert, Melissa Boisjoli-Langlois, Andrew Castiglione, Maram Khalifa, Pooja Hebbal, Barry Stein, Dimitris Bertsimas Learn more

December 18,2025

Personalizing prostate cancer education for patients using an EHR-Integrated LLM agent

Cancer patients often lack timely education and personalized support due to clinician workload. This quality improvement study develops and evaluates a Large Language Model (LLM) agent, MedEduChat, which is integrated with the clinic’s electronic health records (EHR) and designed to enhance prostate cancer patient education. Fifteen non-metastatic prostate cancer patients and three clinicians recruited from the Mayo Clinic interacted with the agent between May 2024 and April 2025. Findings showed that MedEduChat has a high usability score (UMUX = 83.7/100) and improves patients’ health confidence (Health Confidence Score rose from 9.9 to 13.9). Clinicians evaluated the patient-chat interaction history and rated MedEduChat as highly correct (2.9/3), complete (2.7/3), and safe (2.7/3), with moderate personalization (2.3/3). This study highlights the potential of LLM agents to improve patient engagement and health education.

Contributors:Yuexing Hao, Jason Holmes, Mark R. Waddle, Brian J. Davis, Nathan Y. Yu, Kristin S. Vickers, Heather Preston, Drew Margolin, Corinna E. Löckenhoff, Aditya Vashistha, Saleh Kalantari, Marzyeh Ghassemi, and Wei Liu Learn more

December 03,2025

Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models

For an LLM to correctly respond to an instruction it must understand both the semantics and the domain (i.e., subject area) of a given task-instruction pair. However, syntax can also convey implicit information. Recent work shows that \textit{syntactic templates}---frequent sequences of Part-of-Speech (PoS) tags---are prevalent in training data and often appear in model outputs. In this work we characterize syntactic templates, domain, and semantics in task-instruction pairs. We identify cases of spurious correlations between syntax and domain, where models learn to associate a domain with syntax during training; this can sometimes override prompt semantics. Using a synthetic training dataset, we find that the syntactic-domain correlation can lower performance (mean 0.51 +/- 0.06) on entity knowledge tasks in OLMo-2 models (1B-13B). We introduce an evaluation framework to detect this phenomenon in trained models, and show that it occurs on a subset of the FlanV2 dataset in open (OLMo-2-7B; Llama-4-Maverick), and closed (GPT-4o) models. Finally, we present a case study on the implications for LLM security, showing that unintended syntactic-domain correlations can be used to bypass refusals in OLMo-2-7B Instruct and GPT-4o. Our findings highlight two needs: (1) to explicitly test for syntactic-domain correlations, and (2) to ensure \textit{syntactic} diversity in training data, specifically within domains, to prevent such spurious correlations.

Contributors: Chantal Shaib, Vinith Suriyakumar, Byron Wallace, Marzyeh Ghassemi Learn more

November 25,2025

Opportunistic screening of type 2 diabetes with deep metric learning using electronic health records

Deep learning models leveraging electronic health records (EHR) for opportunistic screening of type 2 diabetes (T2D) can improve current practices by identifying individuals who may need further glycemic testing. Accurate onset prediction and subtyping are crucial for targeted interventions, but existing methods treat the tasks separately, thus limiting clinical utility. In this paper, we introduce a novel deep metric learning (DML) model that unifies both tasks by learning a latent space based on sample similarity. In onset prediction, the DML model predicts the onset of T2D 7 years later with an AUC of 0.754, outperforming logistic regression (AUC 0.706), clinical risk factors (AUC 0.693), and glycemic measures (AUC 0.632). For subtyping, we identify three subtypes with varying prevalences of obesity-related, cardiovascular, and mental health conditions. Additionally, the subtype with fewer comorbidities shows earlier metformin initiation and a greater reduction in HbA1c. We validated these findings using data from 300 U.S. hospitals in the All of Us program (T2D, n = 7567) and the Massachusetts General Brigham Biobank (T2D, n = 3298), demonstrating the transferability of our model and subtypes across cohorts.

Contributors: Qixuan Jin, Haoran Zhang, Lukasz Szczerbinski, Jiacheng Zhu, Walter Gerych, Xuhai Xu, Kai Wang, Sarah Hsu, Ravi Mandla, Aaron J. Deutsch, Alisa Manning, Josep M. Mercader, Thomas Hartvigsen, Miriam S. Udler, Marzyeh Ghassemi Learn more

November 14,2025

Barriers to translating continuous monitoring technologies for preventative medicine

While treatment remains essential, disease prevention often proves more effective in improving outcomes, enhancing well-being and reducing healthcare costs. Despite this understanding, preventative medical practices are still underutilized. Continuous monitoring technologies can help to address this gap by enabling early symptom detection, tracking disease recurrence and assessing treatment responses, yet few of the technologies have been integrated into clinical practice. In this Review, we discuss notable advances in continuous monitoring and the barriers to their translation. We focus on technologies that enable either continuous measurement for at least one week or periodic measurements for at least one month, including remotely interfacing technologies, wearables and other directly interfacing systems, and internally interfacing implanted devices. Continuous monitoring improves disease-risk assessment, tracks disease progression and enhances overall health management. However, broader and more reliable datasets from diverse clinical trials, alongside supportive policies and financial incentives, will be essential to overcoming translational barriers and to integrating these technologies into healthcare.

Contributors: Jack Chen, Patricia Jastrzebska-Perfect, Peter Chai, Mehmet Girayhan Say, Jiaobing Tu, Wei Gao, Florencia Halperin, Joshua Korzenik, Hen-Wei Huang, Dina Katabi, Giovanni Traverso Learn more

October 28,2025

Aggregation Hides Out-of-Distribution Generalization Failures from Spurious Correlations

Benchmarks for out-of-distribution (OOD) generalization frequently show a strong positive correlation between in-distribution (ID) and OOD accuracy across models, termed "accuracy-on-the-line." This pattern is often taken to imply that spurious correlations - correlations that improve ID but reduce OOD performance - are rare in practice. We find that this positive correlation is often an artifact of aggregating heterogeneous OOD examples. Using a simple gradient-based method, OODSelect, we identify semantically coherent OOD subsets where accuracy on the line does not hold. Across widely used distribution shift benchmarks, the OODSelect uncovers subsets, sometimes over half of the standard OOD set, where higher ID accuracy predicts lower OOD accuracy. Our findings indicate that aggregate metrics can obscure important failure modes of OOD robustness. We release code and the identified subsets to facilitate further research.

Contributors: Olawale Salaudeen, Haoran Zhang, Kumail Alhamoud, Sara Beery, Marzyeh Ghassemi Learn more

June 23,2025

Lost in Transplantation: Characterizing Racial Gaps in Physician Organ Offer Acceptance

There are known racial disparities in the organ transplant allocation system in the United States. While recent research has focused on designing scores and matching algorithms for organ allocation, prior work has yet to study how transplant center physician decisions on offer acceptance—the final step in the allocation process—contribute to the observed disparities. In this paper, we use data from the Scientific Registry of Transplant Recipients to examine the role of candidate race in the acceptance of heart, liver, and lung transplant offers. We find that Black race was associated with significantly lower odds of offer acceptance for livers and lungs. Further, existing allocation scores such as MELD and LAS did not account for clinical factors that made Black patients harder to match. Our analysis also revealed that donor candidate race-match was associated with significantly higher odds of offer acceptance for hearts, livers, and lungs. Finally, we found that rejecting an offer was associated with lower survival times for all three organs. Our findings demonstrate the additional barriers that Black patients face in accessing organ transplants and the consequences of these barriers on patient survival. Overall, our work highlights the limitations of technical solutions to socio-technical problems; new allocation scores and other algorithmic updates will not improve equity if they do not explicitly account for gaps in the ensuing human decisions.

Co-authors: Hammaad Adam, Rene S. Bermea, Ming Ying Yang, Leo Anthony Celi Learn more

May 01,2025

Significance of Image Reconstruction Parameters for Future Lung Cancer Risk Prediction Using Low-Dose Chest Computed Tomography and the Open-Access Sybil Algorithm

Purpose Sybil is a validated publicly available deep learning–based algorithm that can accurately predict lung cancer risk from a single low-dose computed tomography (LDCT) scan. We aimed to study the effect of image reconstruction parameters and CT scanner manufacturer on Sybil's performance.

Materials and Methods Using LDCTs of a subset of the National Lung Screening Trial participants, which we previously used for internal validation of the Sybil algorithm (test set), we ran the Sybil algorithm on LDCT series pairs matched on kilovoltage peak, milliampere-seconds, reconstruction interval, reconstruction diameter, and either reconstruction filter or axial slice thickness. We also evaluated the cumulative effect of these parameters by combining the best- and the worst-performing parameters. A subanalysis compared Sybil's performance by CT manufacturer. We considered any LDCT positive if future lung cancer was subsequently confirmed by biopsy or surgical resection. The areas under the curve (AUCs) for each series pair were compared using DeLong's test.

Results There was no difference in Sybil's performance between 1049 pairs of standard versus bone reconstruction filter (AUC at 1 year 0.84 [95% confidence interval (CI): 0.70–0.99] vs 0.86 [95% CI: 0.75–0.98], P = 0.87) and 1961 pairs of standard versus lung reconstruction filter (AUC at 1 year 0.98 [95% CI: 0.97–0.99] vs 0.98 [95% CI: 0.96–0.99], P = 0.81). Similarly, there was no difference in 1288 pairs comparing 2-mm versus 5-mm axial slice thickness (AUC at 1 year 0.98 [95% CI: 0.94–1.00] vs 0.99 [95% CI: 0.97–0.99], P = 0.68). The best-case scenario combining a lung reconstruction filter with 2-mm slice thickness compared with the worst-case scenario combining a bone reconstruction filter with 2.5-mm slice thickness uncovered a significantly different performance at years 2–4 (P = 0.03). Subanalysis showed no significant difference in performance between Siemens and Toshiba scanners.

Conclusions Sybil's predictive performance for future lung cancer risk is robust across different reconstruction filters and axial slice thicknesses, demonstrating its versatility in various imaging settings. Combining favorable reconstruction parameters can significantly enhance predictive ability at years 2–4. The absence of significant differences between Siemens and Toshiba scanners further supports Sybil's versatility.

Contributors: Judit Simon, Alexander Graur; Allison Chang, Steven J. Skates, Raymond Osarogiagbon Learn more

February 04,2025

Artificial intelligence for hemodynamic monitoring with a wearable electrocardiogram monitor

Background

The ability to non-invasively measure left atrial pressure would facilitate the identification of patients at risk of pulmonary congestion and guide proactive heart failure care. Wearable cardiac monitors, which record single-lead electrocardiogram data, provide information that can be leveraged to infer left atrial pressures.

Methods

We developed a deep neural network using single-lead electrocardiogram data to determine when the left atrial pressure is elevated. The model was developed and internally evaluated using a cohort of 6739 samples from the Massachusetts General Hospital (MGH) and externally validated on a cohort of 4620 samples from a second institution. We then evaluated model on patch-monitor electrocardiographic data on a small prospective cohort.

Results

The model achieves an area under the receiver operating characteristic curve of 0.80 for detecting elevated left atrial pressures on an internal holdout dataset from MGH and 0.76 on an external validation set from a second institution. A further prospective dataset was obtained using single-lead electrocardiogram data with a patch-monitor from patients who underwent right heart catheterization at MGH. Evaluation of the model on this dataset yielded an area under the receiver operating characteristic curve of 0.875 for identifying elevated left atrial pressures for electrocardiogram signals acquired close to the time of the right heart catheterization procedure.

Conclusions

These results demonstrate the utility and the potential of ambulatory cardiac hemodynamic monitoring with electrocardiogram patch-monitors.

Contributors: Daphne E. Schlesinger, Ridwan Alam, Roey Ringel, Eugene Pomerantsev, Srikanth Devireddy, Pinak Shah, Joseph Garasic Learn more