Access to accurate predictions of patients’ outcomes can enhance decision making within healthcare institutions. Hartford HealthCare has been collaborating with academics and consultants to predict short- and medium-term outcomes for all inpatients across their seven hospitals. We develop machine learning models that predict the probabilities of next 24-hour/48-hour discharge and intensive care unit transfers, end-of-stay mortality, and discharge dispositions. All models achieve high out-of-sample area under the receiver operating curve (75.7%–92.5%) and are well calibrated. In addition, combining 48-hour discharge predictions with doctors’ predictions simultaneously enables more patient discharges (10%–28.7%) and fewer 7-day/30-day readmissions (p < 0.001). We implement an automated pipeline that extracts data and updates predictions every morning, as well as user-friendly software and a color-coded alert system to communicate these patient-level predictions to clinical teams. Since its deployment, more than 200 doctors, nurses, and case managers across seven hospitals have been using the tool in their daily patient review process. With our tool, we find that doctors start the administrative discharge process earlier, leading to a significant reduction in the average length of stay (0.63 days per patient). We anticipate substantial financial benefits (between $52 and $67 million annually) for the healthcare system.
Contributors: Liangyuan Na, Kimberly Villalobos Carballo, Jean Pauphilet, Ali Haddad-Sisakht, Daniel Kombert, Melissa Boisjoli-Langlois, Andrew Castiglione, Maram Khalifa, Pooja Hebbal, Barry Stein, Dimitris Bertsimas Learn more
Cancer patients often lack timely education and personalized support due to clinician workload. This quality improvement study develops and evaluates a Large Language Model (LLM) agent, MedEduChat, which is integrated with the clinic’s electronic health records (EHR) and designed to enhance prostate cancer patient education. Fifteen non-metastatic prostate cancer patients and three clinicians recruited from the Mayo Clinic interacted with the agent between May 2024 and April 2025. Findings showed that MedEduChat has a high usability score (UMUX = 83.7/100) and improves patients’ health confidence (Health Confidence Score rose from 9.9 to 13.9). Clinicians evaluated the patient-chat interaction history and rated MedEduChat as highly correct (2.9/3), complete (2.7/3), and safe (2.7/3), with moderate personalization (2.3/3). This study highlights the potential of LLM agents to improve patient engagement and health education.
Contributors:Yuexing Hao, Jason Holmes, Mark R. Waddle, Brian J. Davis, Nathan Y. Yu, Kristin S. Vickers, Heather Preston, Drew Margolin, Corinna E. Löckenhoff, Aditya Vashistha, Saleh Kalantari, Marzyeh Ghassemi, and Wei Liu Learn more
For an LLM to correctly respond to an instruction it must understand both the semantics and the domain (i.e., subject area) of a given task-instruction pair. However, syntax can also convey implicit information. Recent work shows that \textit{syntactic templates}---frequent sequences of Part-of-Speech (PoS) tags---are prevalent in training data and often appear in model outputs. In this work we characterize syntactic templates, domain, and semantics in task-instruction pairs. We identify cases of spurious correlations between syntax and domain, where models learn to associate a domain with syntax during training; this can sometimes override prompt semantics. Using a synthetic training dataset, we find that the syntactic-domain correlation can lower performance (mean 0.51 +/- 0.06) on entity knowledge tasks in OLMo-2 models (1B-13B). We introduce an evaluation framework to detect this phenomenon in trained models, and show that it occurs on a subset of the FlanV2 dataset in open (OLMo-2-7B; Llama-4-Maverick), and closed (GPT-4o) models. Finally, we present a case study on the implications for LLM security, showing that unintended syntactic-domain correlations can be used to bypass refusals in OLMo-2-7B Instruct and GPT-4o. Our findings highlight two needs: (1) to explicitly test for syntactic-domain correlations, and (2) to ensure \textit{syntactic} diversity in training data, specifically within domains, to prevent such spurious correlations.
Contributors: Chantal Shaib, Vinith Suriyakumar, Byron Wallace, Marzyeh Ghassemi
Learn more
Deep learning models leveraging electronic health records (EHR) for opportunistic screening of type 2 diabetes (T2D) can improve current practices by identifying individuals who may need further glycemic testing. Accurate onset prediction and subtyping are crucial for targeted interventions, but existing methods treat the tasks separately, thus limiting clinical utility. In this paper, we introduce a novel deep metric learning (DML) model that unifies both tasks by learning a latent space based on sample similarity. In onset prediction, the DML model predicts the onset of T2D 7 years later with an AUC of 0.754, outperforming logistic regression (AUC 0.706), clinical risk factors (AUC 0.693), and glycemic measures (AUC 0.632). For subtyping, we identify three subtypes with varying prevalences of obesity-related, cardiovascular, and mental health conditions. Additionally, the subtype with fewer comorbidities shows earlier metformin initiation and a greater reduction in HbA1c. We validated these findings using data from 300 U.S. hospitals in the All of Us program (T2D, n = 7567) and the Massachusetts General Brigham Biobank (T2D, n = 3298), demonstrating the transferability of our model and subtypes across cohorts.
Contributors: Qixuan Jin, Haoran Zhang, Lukasz Szczerbinski, Jiacheng Zhu, Walter Gerych, Xuhai Xu, Kai Wang, Sarah Hsu, Ravi Mandla, Aaron J. Deutsch, Alisa Manning, Josep M. Mercader, Thomas Hartvigsen, Miriam S. Udler, Marzyeh Ghassemi Learn more
While treatment remains essential, disease prevention often proves more effective in improving outcomes, enhancing well-being and reducing healthcare costs. Despite this understanding, preventative medical practices are still underutilized. Continuous monitoring technologies can help to address this gap by enabling early symptom detection, tracking disease recurrence and assessing treatment responses, yet few of the technologies have been integrated into clinical practice. In this Review, we discuss notable advances in continuous monitoring and the barriers to their translation. We focus on technologies that enable either continuous measurement for at least one week or periodic measurements for at least one month, including remotely interfacing technologies, wearables and other directly interfacing systems, and internally interfacing implanted devices. Continuous monitoring improves disease-risk assessment, tracks disease progression and enhances overall health management. However, broader and more reliable datasets from diverse clinical trials, alongside supportive policies and financial incentives, will be essential to overcoming translational barriers and to integrating these technologies into healthcare.
Contributors: Jack Chen, Patricia Jastrzebska-Perfect, Peter Chai, Mehmet Girayhan Say, Jiaobing Tu, Wei Gao, Florencia Halperin, Joshua Korzenik, Hen-Wei Huang, Dina Katabi, Giovanni Traverso Learn more
Benchmarks for out-of-distribution (OOD) generalization frequently show a strong positive correlation between in-distribution (ID) and OOD accuracy across models, termed "accuracy-on-the-line." This pattern is often taken to imply that spurious correlations - correlations that improve ID but reduce OOD performance - are rare in practice. We find that this positive correlation is often an artifact of aggregating heterogeneous OOD examples. Using a simple gradient-based method, OODSelect, we identify semantically coherent OOD subsets where accuracy on the line does not hold. Across widely used distribution shift benchmarks, the OODSelect uncovers subsets, sometimes over half of the standard OOD set, where higher ID accuracy predicts lower OOD accuracy. Our findings indicate that aggregate metrics can obscure important failure modes of OOD robustness. We release code and the identified subsets to facilitate further research.
Contributors: Olawale Salaudeen, Haoran Zhang, Kumail Alhamoud, Sara Beery, Marzyeh Ghassemi Learn more
There are known racial disparities in the organ transplant allocation system in the United States. While recent research has focused on designing scores and matching algorithms for organ allocation, prior work has yet to study how transplant center physician decisions on offer acceptance—the final step in the allocation process—contribute to the observed disparities. In this paper, we use data from the Scientific Registry of Transplant Recipients to examine the role of candidate race in the acceptance of heart, liver, and lung transplant offers. We find that Black race was associated with significantly lower odds of offer acceptance for livers and lungs. Further, existing allocation scores such as MELD and LAS did not account for clinical factors that made Black patients harder to match. Our analysis also revealed that donor candidate race-match was associated with significantly higher odds of offer acceptance for hearts, livers, and lungs. Finally, we found that rejecting an offer was associated with lower survival times for all three organs. Our findings demonstrate the additional barriers that Black patients face in accessing organ transplants and the consequences of these barriers on patient survival. Overall, our work highlights the limitations of technical solutions to socio-technical problems; new allocation scores and other algorithmic updates will not improve equity if they do not explicitly account for gaps in the ensuing human decisions. Co-authors: Hammaad Adam, Rene S. Bermea, Ming Ying Yang, Leo Anthony Celi Learn more
Purpose
Sybil is a validated publicly available deep learning–based algorithm that can accurately predict lung cancer risk from a single low-dose computed tomography (LDCT) scan. We aimed to study the effect of image reconstruction parameters and CT scanner manufacturer on Sybil's performance.
Materials and Methods
Using LDCTs of a subset of the National Lung Screening Trial participants, which we previously used for internal validation of the Sybil algorithm (test set), we ran the Sybil algorithm on LDCT series pairs matched on kilovoltage peak, milliampere-seconds, reconstruction interval, reconstruction diameter, and either reconstruction filter or axial slice thickness. We also evaluated the cumulative effect of these parameters by combining the best- and the worst-performing parameters. A subanalysis compared Sybil's performance by CT manufacturer. We considered any LDCT positive if future lung cancer was subsequently confirmed by biopsy or surgical resection. The areas under the curve (AUCs) for each series pair were compared using DeLong's test.
Results
There was no difference in Sybil's performance between 1049 pairs of standard versus bone reconstruction filter (AUC at 1 year 0.84 [95% confidence interval (CI): 0.70–0.99] vs 0.86 [95% CI: 0.75–0.98], P = 0.87) and 1961 pairs of standard versus lung reconstruction filter (AUC at 1 year 0.98 [95% CI: 0.97–0.99] vs 0.98 [95% CI: 0.96–0.99], P = 0.81). Similarly, there was no difference in 1288 pairs comparing 2-mm versus 5-mm axial slice thickness (AUC at 1 year 0.98 [95% CI: 0.94–1.00] vs 0.99 [95% CI: 0.97–0.99], P = 0.68). The best-case scenario combining a lung reconstruction filter with 2-mm slice thickness compared with the worst-case scenario combining a bone reconstruction filter with 2.5-mm slice thickness uncovered a significantly different performance at years 2–4 (P = 0.03). Subanalysis showed no significant difference in performance between Siemens and Toshiba scanners.
Conclusions
Sybil's predictive performance for future lung cancer risk is robust across different reconstruction filters and axial slice thicknesses, demonstrating its versatility in various imaging settings. Combining favorable reconstruction parameters can significantly enhance predictive ability at years 2–4. The absence of significant differences between Siemens and Toshiba scanners further supports Sybil's versatility.
Contributors: Judit Simon, Alexander Graur; Allison Chang, Steven J. Skates, Raymond Osarogiagbon Learn more
Background
The ability to non-invasively measure left atrial pressure would facilitate the identification of patients at risk of pulmonary congestion and guide proactive heart failure care. Wearable cardiac monitors, which record single-lead electrocardiogram data, provide information that can be leveraged to infer left atrial pressures.
Methods
We developed a deep neural network using single-lead electrocardiogram data to determine when the left atrial pressure is elevated. The model was developed and internally evaluated using a cohort of 6739 samples from the Massachusetts General Hospital (MGH) and externally validated on a cohort of 4620 samples from a second institution. We then evaluated model on patch-monitor electrocardiographic data on a small prospective cohort.
Results
The model achieves an area under the receiver operating characteristic curve of 0.80 for detecting elevated left atrial pressures on an internal holdout dataset from MGH and 0.76 on an external validation set from a second institution. A further prospective dataset was obtained using single-lead electrocardiogram data with a patch-monitor from patients who underwent right heart catheterization at MGH. Evaluation of the model on this dataset yielded an area under the receiver operating characteristic curve of 0.875 for identifying elevated left atrial pressures for electrocardiogram signals acquired close to the time of the right heart catheterization procedure.
Conclusions
These results demonstrate the utility and the potential of ambulatory cardiac hemodynamic monitoring with electrocardiogram patch-monitors.
Contributors: Daphne E. Schlesinger, Ridwan Alam, Roey Ringel, Eugene Pomerantsev, Srikanth Devireddy, Pinak Shah, Joseph Garasic Learn more
Without careful dissection of the ways in which biases can be encoded into artificial intelligence (AI) health technologies, there is a risk of perpetuating existing health inequalities at scale. One major source of bias is the data that underpins such technologies. The STANDING Together recommendations aim to encourage transparency regarding limitations of health datasets and proactive evaluation of their effect across population groups. Draft recommendation items were informed by a systematic review and stakeholder survey. The recommendations were developed using a Delphi approach, supplemented by a public consultation and international interview study. Overall, more than 350 representatives from 58 countries provided input into this initiative. 194 Delphi participants from 25 countries voted and provided comments on 32 candidate items across three electronic survey rounds and one in-person consensus meeting. The 29 STANDING Together consensus recommendations are presented here in two parts. Recommendations for Documentation of Health Datasets provide guidance for dataset curators to enable transparency around data composition and limitations. Recommendations for Use of Health Datasets aim to enable identification and mitigation of algorithmic biases that might exacerbate health inequalities. These recommendations are intended to prompt proactive inquiry rather than acting as a checklist. We hope to raise awareness that no dataset is free of limitations, so transparent communication of data limitations should be perceived as valuable, and absence of this information as a limitation. We hope that adoption of the STANDING Together recommendations by stakeholders across the AI health technology lifecycle will enable everyone in society to benefit from technologies which are safe and effective.
Contributors:Joseph E Alderman, Joanne Palmer, Elinor Laws, Melissa D McCradden, Johan Ordish, Marzyeh Ghassemi, Stephen R Pfohl, Negar Rostamzadeh, Heather Cole-Lewis, Ben Glocker, Melanie Calvert, Tom J Pollard, Jaspret Gill, Jacqui Gath, Adewale Adebajo, Jude Beng, Cassandra H Leung, Stephanie Kuku, Lesley-Anne Farmer, Rubeta N Matin, Bilal A Mateen, Francis McKay, Katherine Heller, Alan Karthikesalingam, Darren Treanor, Maxine Mackintosh, Lauren Oakden-Rayner, Russell Pearson, Arjun K Manrai, Puja Myles, Judit Kumuthini, Zoher Kapacee, Neil J Sebire, Lama H Nazer, Jarrel Seah, Ashley Akbari, Lew Berman, Judy W Gichoya, Lorenzo Righetto, Diana Samuel, William Wasswa, Maria Charalambides, Anmol Arora, Sameer Pujari, Charlotte Summers, Elizabeth Sapey, Sharon Wilkinson, Vishal Thakker, Alastair Denniston, Xiaoxuan Liu Learn more