Skip to Content

Focus Area: Clinical AI

Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models

For an LLM to correctly respond to an instruction it must understand both the semantics and the domain (i.e., subject area) of a given task-instruction pair. However, syntax can also convey implicit information. Recent work shows that \textit{syntactic templates}---frequent sequences of Part-of-Speech (PoS) tags---are prevalent in training data and often appear in model outputs. In this work we characterize syntactic templates, domain, and semantics in task-instruction pairs. We identify cases of spurious correlations between syntax and domain, where models learn to associate a domain with syntax during training; this can sometimes override prompt semantics. Using a synthetic training dataset, we find that the syntactic-domain correlation can lower performance (mean 0.51 +/- 0.06) on entity knowledge tasks in OLMo-2 models (1B-13B). We introduce an evaluation framework to detect this phenomenon in trained models, and show that it occurs on a subset of the FlanV2 dataset in open (OLMo-2-7B; Llama-4-Maverick), and closed (GPT-4o) models. Finally, we present a case study on the implications for LLM security, showing that unintended syntactic-domain correlations can be used to bypass refusals in OLMo-2-7B Instruct and GPT-4o. Our findings highlight two needs: (1) to explicitly test for syntactic-domain correlations, and (2) to ensure \textit{syntactic} diversity in training data, specifically within domains, to prevent such spurious correlations.

Contributors: Chantal Shaib, Vinith Suriyakumar, Byron Wallace, Marzyeh Ghassemi Learn more

Opportunistic screening of type 2 diabetes with deep metric learning using electronic health records

Deep learning models leveraging electronic health records (EHR) for opportunistic screening of type 2 diabetes (T2D) can improve current practices by identifying individuals who may need further glycemic testing. Accurate onset prediction and subtyping are crucial for targeted interventions, but existing methods treat the tasks separately, thus limiting clinical utility. In this paper, we introduce a novel deep metric learning (DML) model that unifies both tasks by learning a latent space based on sample similarity. In onset prediction, the DML model predicts the onset of T2D 7 years later with an AUC of 0.754, outperforming logistic regression (AUC 0.706), clinical risk factors (AUC 0.693), and glycemic measures (AUC 0.632). For subtyping, we identify three subtypes with varying prevalences of obesity-related, cardiovascular, and mental health conditions. Additionally, the subtype with fewer comorbidities shows earlier metformin initiation and a greater reduction in HbA1c. We validated these findings using data from 300 U.S. hospitals in the All of Us program (T2D, n = 7567) and the Massachusetts General Brigham Biobank (T2D, n = 3298), demonstrating the transferability of our model and subtypes across cohorts.

Contributors: Qixuan Jin, Haoran Zhang, Lukasz Szczerbinski, Jiacheng Zhu, Walter Gerych, Xuhai Xu, Kai Wang, Sarah Hsu, Ravi Mandla, Aaron J. Deutsch, Alisa Manning, Josep M. Mercader, Thomas Hartvigsen, Miriam S. Udler, Marzyeh Ghassemi Learn more

Barriers to translating continuous monitoring technologies for preventative medicine

While treatment remains essential, disease prevention often proves more effective in improving outcomes, enhancing well-being and reducing healthcare costs. Despite this understanding, preventative medical practices are still underutilized. Continuous monitoring technologies can help to address this gap by enabling early symptom detection, tracking disease recurrence and assessing treatment responses, yet few of the technologies have been integrated into clinical practice. In this Review, we discuss notable advances in continuous monitoring and the barriers to their translation. We focus on technologies that enable either continuous measurement for at least one week or periodic measurements for at least one month, including remotely interfacing technologies, wearables and other directly interfacing systems, and internally interfacing implanted devices. Continuous monitoring improves disease-risk assessment, tracks disease progression and enhances overall health management. However, broader and more reliable datasets from diverse clinical trials, alongside supportive policies and financial incentives, will be essential to overcoming translational barriers and to integrating these technologies into healthcare.

Contributors: Jack Chen, Patricia Jastrzebska-Perfect, Peter Chai, Mehmet Girayhan Say, Jiaobing Tu, Wei Gao, Florencia Halperin, Joshua Korzenik, Hen-Wei Huang, Dina Katabi, Giovanni Traverso Learn more

Lost in Transplantation: Characterizing Racial Gaps in Physician Organ Offer Acceptance

There are known racial disparities in the organ transplant allocation system in the United States. While recent research has focused on designing scores and matching algorithms for organ allocation, prior work has yet to study how transplant center physician decisions on offer acceptance—the final step in the allocation process—contribute to the observed disparities. In this paper, we use data from the Scientific Registry of Transplant Recipients to examine the role of candidate race in the acceptance of heart, liver, and lung transplant offers. We find that Black race was associated with significantly lower odds of offer acceptance for livers and lungs. Further, existing allocation scores such as MELD and LAS did not account for clinical factors that made Black patients harder to match. Our analysis also revealed that donor candidate race-match was associated with significantly higher odds of offer acceptance for hearts, livers, and lungs. Finally, we found that rejecting an offer was associated with lower survival times for all three organs. Our findings demonstrate the additional barriers that Black patients face in accessing organ transplants and the consequences of these barriers on patient survival. Overall, our work highlights the limitations of technical solutions to socio-technical problems; new allocation scores and other algorithmic updates will not improve equity if they do not explicitly account for gaps in the ensuing human decisions.

Co-authors: Hammaad Adam, Rene S. Bermea, Ming Ying Yang, Leo Anthony Celi Learn more

Significance of Image Reconstruction Parameters for Future Lung Cancer Risk Prediction Using Low-Dose Chest Computed Tomography and the Open-Access Sybil Algorithm

Purpose Sybil is a validated publicly available deep learning–based algorithm that can accurately predict lung cancer risk from a single low-dose computed tomography (LDCT) scan. We aimed to study the effect of image reconstruction parameters and CT scanner manufacturer on Sybil's performance.

Materials and Methods Using LDCTs of a subset of the National Lung Screening Trial participants, which we previously used for internal validation of the Sybil algorithm (test set), we ran the Sybil algorithm on LDCT series pairs matched on kilovoltage peak, milliampere-seconds, reconstruction interval, reconstruction diameter, and either reconstruction filter or axial slice thickness. We also evaluated the cumulative effect of these parameters by combining the best- and the worst-performing parameters. A subanalysis compared Sybil's performance by CT manufacturer. We considered any LDCT positive if future lung cancer was subsequently confirmed by biopsy or surgical resection. The areas under the curve (AUCs) for each series pair were compared using DeLong's test.

Results There was no difference in Sybil's performance between 1049 pairs of standard versus bone reconstruction filter (AUC at 1 year 0.84 [95% confidence interval (CI): 0.70–0.99] vs 0.86 [95% CI: 0.75–0.98], P = 0.87) and 1961 pairs of standard versus lung reconstruction filter (AUC at 1 year 0.98 [95% CI: 0.97–0.99] vs 0.98 [95% CI: 0.96–0.99], P = 0.81). Similarly, there was no difference in 1288 pairs comparing 2-mm versus 5-mm axial slice thickness (AUC at 1 year 0.98 [95% CI: 0.94–1.00] vs 0.99 [95% CI: 0.97–0.99], P = 0.68). The best-case scenario combining a lung reconstruction filter with 2-mm slice thickness compared with the worst-case scenario combining a bone reconstruction filter with 2.5-mm slice thickness uncovered a significantly different performance at years 2–4 (P = 0.03). Subanalysis showed no significant difference in performance between Siemens and Toshiba scanners.

Conclusions Sybil's predictive performance for future lung cancer risk is robust across different reconstruction filters and axial slice thicknesses, demonstrating its versatility in various imaging settings. Combining favorable reconstruction parameters can significantly enhance predictive ability at years 2–4. The absence of significant differences between Siemens and Toshiba scanners further supports Sybil's versatility.

Contributors: Judit Simon, Alexander Graur; Allison Chang, Steven J. Skates, Raymond Osarogiagbon Learn more

Artificial intelligence for hemodynamic monitoring with a wearable electrocardiogram monitor

Background

The ability to non-invasively measure left atrial pressure would facilitate the identification of patients at risk of pulmonary congestion and guide proactive heart failure care. Wearable cardiac monitors, which record single-lead electrocardiogram data, provide information that can be leveraged to infer left atrial pressures.

Methods

We developed a deep neural network using single-lead electrocardiogram data to determine when the left atrial pressure is elevated. The model was developed and internally evaluated using a cohort of 6739 samples from the Massachusetts General Hospital (MGH) and externally validated on a cohort of 4620 samples from a second institution. We then evaluated model on patch-monitor electrocardiographic data on a small prospective cohort.

Results

The model achieves an area under the receiver operating characteristic curve of 0.80 for detecting elevated left atrial pressures on an internal holdout dataset from MGH and 0.76 on an external validation set from a second institution. A further prospective dataset was obtained using single-lead electrocardiogram data with a patch-monitor from patients who underwent right heart catheterization at MGH. Evaluation of the model on this dataset yielded an area under the receiver operating characteristic curve of 0.875 for identifying elevated left atrial pressures for electrocardiogram signals acquired close to the time of the right heart catheterization procedure.

Conclusions

These results demonstrate the utility and the potential of ambulatory cardiac hemodynamic monitoring with electrocardiogram patch-monitors.

Contributors: Daphne E. Schlesinger, Ridwan Alam, Roey Ringel, Eugene Pomerantsev, Srikanth Devireddy, Pinak Shah, Joseph Garasic Learn more

Tackling algorithmic bias and promoting transparency in health datasets: the STANDING Together consensus recommendations

Without careful dissection of the ways in which biases can be encoded into artificial intelligence (AI) health technologies, there is a risk of perpetuating existing health inequalities at scale. One major source of bias is the data that underpins such technologies. The STANDING Together recommendations aim to encourage transparency regarding limitations of health datasets and proactive evaluation of their effect across population groups. Draft recommendation items were informed by a systematic review and stakeholder survey. The recommendations were developed using a Delphi approach, supplemented by a public consultation and international interview study. Overall, more than 350 representatives from 58 countries provided input into this initiative. 194 Delphi participants from 25 countries voted and provided comments on 32 candidate items across three electronic survey rounds and one in-person consensus meeting. The 29 STANDING Together consensus recommendations are presented here in two parts. Recommendations for Documentation of Health Datasets provide guidance for dataset curators to enable transparency around data composition and limitations. Recommendations for Use of Health Datasets aim to enable identification and mitigation of algorithmic biases that might exacerbate health inequalities. These recommendations are intended to prompt proactive inquiry rather than acting as a checklist. We hope to raise awareness that no dataset is free of limitations, so transparent communication of data limitations should be perceived as valuable, and absence of this information as a limitation. We hope that adoption of the STANDING Together recommendations by stakeholders across the AI health technology lifecycle will enable everyone in society to benefit from technologies which are safe and effective.

Contributors:Joseph E Alderman, Joanne Palmer, Elinor Laws, Melissa D McCradden, Johan Ordish, Marzyeh Ghassemi, Stephen R Pfohl, Negar Rostamzadeh, Heather Cole-Lewis, Ben Glocker, Melanie Calvert, Tom J Pollard, Jaspret Gill, Jacqui Gath, Adewale Adebajo, Jude Beng, Cassandra H Leung, Stephanie Kuku, Lesley-Anne Farmer, Rubeta N Matin, Bilal A Mateen, Francis McKay, Katherine Heller, Alan Karthikesalingam, Darren Treanor, Maxine Mackintosh, Lauren Oakden-Rayner, Russell Pearson, Arjun K Manrai, Puja Myles, Judit Kumuthini, Zoher Kapacee, Neil J Sebire, Lama H Nazer, Jarrel Seah, Ashley Akbari, Lew Berman, Judy W Gichoya, Lorenzo Righetto, Diana Samuel, William Wasswa, Maria Charalambides, Anmol Arora, Sameer Pujari, Charlotte Summers, Elizabeth Sapey, Sharon Wilkinson, Vishal Thakker, Alastair Denniston, Xiaoxuan Liu Learn more

Faster Machine Unlearning via Natural Gradient Descent

We address the challenge of efficiently and reliably deleting data from machine learning models trained using Empirical Risk Minimization (ERM), a process known as machine unlearning. To avoid retraining models from scratch, we propose a novel algorithm leveraging Natural Gradient Descent (NGD). Our theoretical framework ensures strong privacy guarantees for convex models, while a practical Min/Max optimization algorithm is developed for non-convex models. Comprehensive evaluations show significant improvements in privacy, computational efficiency, and generalization compared to state-of-the-art methods, advancing both the theoretical and practical aspects of machine unlearning.

Contributors: Omri Lev Learn more

MicrobioRaman: an open-access web repository for microbiological Raman spectroscopy data

Here we present the establishment of an open-access web-based repository for microbiological Raman spectroscopy data. The data collection, called ‘MicrobioRaman’ (https://www.ebi.ac.uk/biostudies/MicrobioRaman/studies), was inspired by the great success and usefulness of research databases such as GenBank and UniProt. This centralized repository, residing within the BioStudies database — which is maintained by a public institution, the European Bioinformatics Institute — minimizes the risk of data loss or eventual abandonment, offering a long-term common reference for analysis with advantages in accessibility and transparency over commercial data analysis tools. We feel that MicrobioRaman will provide a foundation for this growing field by serving as an open-access repository for sharing microbiological Raman data and through the codification of a set of reporting standards. Contributors: Kang Soo Lee, Zachary Landry, Awais Athar, Uria Alcolombri, Pratchaya Pramoj Na Ayutthaya, David Berry, Philippe de Bettignies, Ji-Xin Cheng, Gabor Csucs, Li Cui, Volker Deckert, Thomas Dieing, Jennifer Dionne, Ondrej Doskocil, Glen D’Souza, Cristina García-Timermans, Notburga Gierlinger, Keisuke Goda, Roland Hatzenpichler, Richard Henshaw, Wei Huang, Ievgeniia Iermak, Natalia Ivleva, Janina Kneipp, Patrick Kubryk, Kirsten Küsel, Tae Kwon Lee, Sung Sik Lee, Bo Ma, Clara Martínez-Pérez, Pavel Matousek, Rainer U. Meckenstock, Wei Min, Peter Mojzeš, Oliver Müller, Naresh Kumar, Per Halkjær Nielsen, Ioan Notingher, Márton Palatinszky, Fátima C. Pereira, Giuseppe Pezzotti, Zdenek Pilat, Filip Plesinger, Jürgen Popp, Alexander Probst, Alessandra Riva, Amr. Saleh, Ota Samek, Haley Sapers, Olga Schubert, Astrid Stubbusch, Gordon Taylor, Michael Wagner, Jing Wang, Huabing Yin, Yang Yue, Renato Zenobi, Jacopo Zini, Ugis Sarkans & Roman Stocker. Learn more

Sharpness-Aware Minimization (SAM) Improves Classification Accuracy of Bacterial Raman Spectral Data Enabling Portable Diagnostics

Antimicrobial resistance is expected to claim 10 million lives per year by 2050, and resource-limited regions are most affected. Raman spectroscopy is a novel pathogen diagnostic approach promising rapid and portable antibiotic resistance testing within a few hours, compared to days when using gold standard methods. However, current algorithms for Raman spectra analysis 1) are unable to generalize well on limited datasets across diverse patient populations and 2) require increased complexity due to the necessity of non-trivial pre-processing steps, such as feature extraction, which are essential to mitigate the low-quality nature of Raman spectral data. In this work, we address these limitations using Sharpness-Aware Minimization (SAM) to enhance model generalization across a diverse array of hyperparameters in clinical bacterial isolate classification tasks. We demonstrate that SAM achieves accuracy improvements of up to 10.7% on a single split, and an increase in average accuracy of 2.5% across all splits in spectral classification tasks over the traditional optimizer, Adam. These results display the capability of SAM to advance the clinical application of AI-powered Raman spectroscopy tools.

Contributors: Kaitlin Zareno, Jarett Dewbury, Siamak Sorooshyari, Hossein Mobahi Learn more
image description