Skip to Content

Focus Area: Clinical AI

Sharpness-Aware Minimization (SAM) Improves Classification Accuracy of Bacterial Raman Spectral Data Enabling Portable Diagnostics

Antimicrobial resistance is expected to claim 10 million lives per year by 2050, and resource-limited regions are most affected. Raman spectroscopy is a novel pathogen diagnostic approach promising rapid and portable antibiotic resistance testing within a few hours, compared to days when using gold standard methods. However, current algorithms for Raman spectra analysis 1) are unable to generalize well on limited datasets across diverse patient populations and 2) require increased complexity due to the necessity of non-trivial pre-processing steps, such as feature extraction, which are essential to mitigate the low-quality nature of Raman spectral data. In this work, we address these limitations using Sharpness-Aware Minimization (SAM) to enhance model generalization across a diverse array of hyperparameters in clinical bacterial isolate classification tasks. We demonstrate that SAM achieves accuracy improvements of up to 10.7% on a single split, and an increase in average accuracy of 2.5% across all splits in spectral classification tasks over the traditional optimizer, Adam. These results display the capability of SAM to advance the clinical application of AI-powered Raman spectroscopy tools.

Contributors: Kaitlin Zareno, Jarett Dewbury, Siamak Sorooshyari, Hossein Mobahi Learn more

TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods

The TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) statement was published in 2015 to provide the minimum reporting recommendations for studies developing or evaluating the performance of a prediction model. Methodological advances in the field of prediction have since included the widespread use of artificial intelligence (AI) powered by machine learning methods to develop prediction models. An update to the TRIPOD statement is thus needed. TRIPOD+AI provides harmonised guidance for reporting prediction model studies, irrespective of whether regression modelling or machine learning methods have been used. The new checklist supersedes the TRIPOD 2015 checklist, which should no longer be used. This article describes the development of TRIPOD+AI and presents the expanded 27 item checklist with more detailed explanation of each reporting recommendation, and the TRIPOD+AI for Abstracts checklist. TRIPOD+AI aims to promote the complete, accurate, and transparent reporting of studies that develop a prediction model or evaluate its performance. Complete reporting will facilitate study appraisal, model evaluation, and model implementation.

Contributors: Gary S Collins, Karel G M Moons, Paula Dhiman, Richard D Riley, Andrew L Beam, Ben Van Calster, Xiaoxuan Liu, Johannes B Reitsma, Maarten van Smeden, Anne-Laure Boulesteix, Jennifer Catherine Camaradou, Leo Anthony Celi, Spiros Denaxas, Alastair K Denniston, Ben Glocker, Robert M Golub, Hugh Harvey, Georg Heinze, Michael M Hoffman, André Pascal Kengne, Emily Lam, Naomi Lee, Elizabeth W Loder, Lena Maier-Hein, Bilal A Mateen, Melissa D McCradden, Lauren Oakden-Rayner, Johan Ordish, Richard Parnell, Sherri Rose, Karandeep Singh, Laure Wynants, Patricia Logullo Learn more

Integrating Technology into Undergraduate Medical Education: Can Affective Computing Help Teach Empathy?

To the Editor:

Substance use disorders (SUDs) and overdose deaths continue at record levels in the USA. One major barrier to adequate treatment is the stigma attached to the condition. Evidence suggests that clinicians have more negative attitudes and less empathy toward patients with SUDs compared to other medical and mental health conditions, thereby affecting the overall quality of care these patients receive [1]. Stigma can become apparent during clinical interactions where providers may unintentionally convey negative emotions or judgments through their facial expressions.

Until recently, empathy toward this patient population was previously thought of as an inherent trait that could not be taught. However, studies in the medical literature have shown that medical trainees do have the capability to improve their empathy toward patients [2]. Given that a physician’s ability to communicate effectively is associated with better patient outcomes, it is imperative to educate future physicians about how stigma manifests in the clinical setting and the importance of empathetic communication.

A promising approach to achieving this goal is through a technology called affective computing, also called emotional artificial intelligence. Affective computing enables computers to recognize, interpret, process, and simulate human emotion. Researchers from the MIT Media Lab at the Massachusetts Institute of Technology and Weill Cornell Medical College have developed Medship, a computerized training module. Medship leverages affective computing to educate future medical providers about the stigma toward patients with SUDs. It offers interactions with virtual (i.e., computerized) patients who have a SUD, records the user in such interaction, and then simultaneously analyzes the user’s facial expressions to provide feedback on such expressions in real time. The software used was OpenFace, a lightweight, open source toolkit used for facial behavior analysis.

This project is being split into two studies. The initial study aimed to evaluate the usability and acceptability of Medship among medical students. Given the multitude of educational options currently available to medical students, their willingness to adopt the application is pivotal to its success. The second part of this project will be a randomized control trial to assess the module’s impact on decreasing negative attitudes to this patient population and will be critical in evaluating its efficacy.

The initial study of this project used a quantitative interventional design, including a cross-sectional survey following a single session of using Medship. The Institutional Review Board at Weill Cornell Medical College granted approval for the study protocol. The online link for Medship was emailed to medical students during the course of their regular education. All feedback as contained in the module. A total of 26 students at Weill Cornell Medical College participated, providing anonymous responses to demographic questions, a System Usability Scale [3] and a System Quality Scale [4]. Usability refers to the ease of using the module, while acceptability gauges students’ willingness to integrate the module into their medical curriculum.

The results from this pilot study demonstrated positive feedback. Regarding usability, all students found it easy to learn and navigate the module. Most students reported that the module was both enjoyable and user-friendly (n = 20; 77%) and found the graphics to be of high quality and resolution (n = 25; 96%). Participants assigned an average of 85 on the System Usability Scale, where a score of 73 or above indicates satisfactory usability. Regarding acceptability, each student believed that their medical institution should offer Medship as part of the educational curriculum, and a substantial portion felt that medical students would greatly benefit from using the module (n = 20; 77%). On the System Quality Scale, participants rated the module an average of 4, where a score of 3 or higher indicates satisfactory acceptability.

One limitation of Medship is its potential lack of cultural diversity in the inputs that it receives to use in its algorithms that analyze facial action units. Expression of empathy in Western culture often assumes a “one-size-fits-all” approach without taking into consideration intercultural contexts. The current version of Medship is limited from a diversity standpoint in terms of the number of inputs it has from users coming from different backgrounds, cultures, and ethnicities. Future iterations of Medship must address this to enhance external validity.

Previous research has revealed that patients with major depression perceive neutral faces as sad compared to healthy participants who interpret them as happy [5]. It raises the question of whether patients with SUDs might exhibit distinct perceptions of neutral faces, particularly in light of the common comorbidity of SUDs and mood disorders. While one approach could be controlling for these comorbidities, a more clinically valuable direction could be to develop a unique version of Medship addressing patients with SUDs and specific comorbidities.

SUDs are becoming increasingly prevalent, remain significantly undertreated, and are stigmatized by clinicians more so than other medical and psychiatric illnesses. Affective computing is gaining prominence across industries, and the field of medicine is now exploring both its safety and efficacy in enhancing patient care. Medship has the capability of improving empathetic communication between providers and their patients. The first iteration of this study has revealed positive results in terms of the technology’s usability and acceptability by medical students, and the next portion of this study will focus on assessing Medship’s efficacy as an application.

Contributors Michael Woods, Giselle Appel, Aidana Daulbayeva, Caleb Harris, Julia Iyasere, Jonathan Avery Learn more

Adaptive Optimization for Prediction with Missing Data

When training predictive models on data with missing entries, the most widely used and versatile approach is a pipeline technique where we first impute missing entries and then compute predictions. In this paper, we view prediction with missing data as a two-stage adaptive optimization problem and propose a new class of models, adaptive linear regression models, where the regression coefficients adapt to the set of observed features. We show that some adaptive linear regression models are equivalent to learning an imputation rule and a downstream linear regression model simultaneously instead of sequentially. We leverage this joint-impute-then-regress interpretation to generalize our framework to non-linear models. In settings where data is strongly not missing at random, our methods achieve a 2–10% improvement in out-of-sample accuracy.

Contributors: Arthur Delarue, Jean Pauphilet Learn more

Probing the Limits and Capabilities of Diffusion Models for the Anatomic Editing of Digital Twins

Numerical simulations can model the physical processes that govern cardiovascular device deployment. When such simulations incorporate digital twins; computational models of patient-specific anatomy, they can expedite and de-risk the device design process. Nonetheless, the exclusive use of patient-specific data constrains the anatomic variability which can be precisely or fully explored. In this study, we investigate the capacity of Latent Diffusion Models (LDMs) to edit digital twins to create anatomic variants, which we term digital siblings. Digital twins and their corresponding siblings can serve as the basis for comparative simulations, enabling the study of how subtle anatomic variations impact the simulated deployment of cardiovascular devices, as well as the augmentation of virtual cohorts for device assessment. However, while diffusion models have been characterized in their ability to edit natural images, their capacity to anatomically edit digital twins has yet to be studied. Using a case example centered on 3D digital twins of cardiac anatomy, we implement various methods for generating digital siblings and characterize them through morphological and topological analyses. We specifically edit digital twins to introduce anatomic variation at different spatial scales and within localized regions, demonstrating the existence of bias towards common anatomic features. We further show that such anatomic bias can be leveraged for virtual cohort augmentation through selective editing, partially alleviating issues related to dataset imbalance and lack of diversity. Our experimental framework thus delineates the limits and capabilities of using latent diffusion models in synthesizing anatomic variation for in silico trials.

Contributors: Karim Kadry, Shreya Gupta, Farhad R. Nezami Learn more

Wearable Technology in Clinical Practice for Depressive Disorder

Joe, who has received a diagnosis of major depressive disorder, is meeting every 2 weeks with his psychiatrist, Sandy.

“How have you been feeling since we last met?” asks Sandy.

“Much better,” says Joe. “I’ve been much more active and social, and I’m sleeping great!”

“That’s wonderful,” says Sandy. “But…I think your wearable must be broken. The data from it looks very irregular for your sleep these past 2 weeks.”

“Oh,” says Joe, “it’s not broken. Actually, now that you mention it, my sleep has been really messed up. I slept well only yesterday.”

“Well,” asks Sandy, “should we talk more about how we can improve your sleep?”

This conversation is based on a real patient–therapist interaction. In this case, the data from wearable technology served as a prompt to obtain details of the patient’s life that might have otherwise been missed. Traditional clinical assessments depend on patient recall. Although such recall can include important factors that wearable technology (often termed “wearables”) do not detect, such as patients’ reports of distress, the assessments by wearables of longitudinal data from daily life may augment methods of monitoring and treating depression, providing objective complements to subjective information from patients.

Contributors: Szymon Fedor, Robert Lewis., Paola Pedrelli, David Mischoulon, Joshua Curtiss Learn more

Effective Human-AI Teams via Learned Natural Language Rules and Onboarding

People are relying on AI agents to assist them with various tasks. The human must know when to rely on the agent, collaborate with the agent, or ignore its suggestions. In this work, we propose to learn rules grounded in data regions and described in natural language that illustrate how the human should collaborate with the AI. Our novel region discovery algorithm finds local regions in the data as neighborhoods in an embedding space that corrects the human prior. Each region is then described using an iterative and contrastive procedure where a large language model describes the region. We then teach these rules to the human via an onboarding stage. Through user studies on object detection and question-answering tasks, we show that our method can lead to more accurate human-AI teams. We also evaluate our region discovery and description algorithms separately.

Contributors: Hussein Mozannar, Jimin J Lee, Dennis Wei, Prasanna Sattigeri, Subhro Das Learn more

A Deep Dive into Single-Cell RNA Sequencing Foundation Models

Large-scale foundation models, which are pre-trained on massive, unlabeled datasets and subsequently fine-tuned on specific tasks, have recently achieved unparalleled success on a wide array of applications, including in healthcare and biology. In this paper, we explore two foundation models recently developed for single-cell RNA sequencing data, scBERT and scGPT. Focusing on the fine-tuning task of cell type annotation, we explore the relative performance of pre-trained models compared to a simple baseline, L1-regularized logistic regression, including in the few-shot setting. We perform ablation studies to understand whether pretraining improves model performance and to better understand the difficulty of the pre-training task in scBERT. Finally, using scBERT as an example, we demonstrate the potential sensitivity of fine-tuning to hyperparameter settings and parameter initializations. Taken together, our results highlight the importance of rigorously testing foundation models against well established baselines, establishing challenging fine-tuning tasks on which to benchmark foundation models, and performing deep introspection into the embeddings learned by the model in order to more effectively harness these models to transform single-cell data analysis. Code is available at

Contributors: Rebecca Boiarsky, Nalini Singh, Alejandro Buendia, Gad Getz Learn more

Taking Off with AI: Lessons from Aviation for Healthcare

Artificial intelligence (AI) stands to improve healthcare through innovative new systems ranging from diagnosis aids to patient tools. However, such “Health AI” systems are complicated and challenging to integrate into standing clinical practice. With advancing AI, regulations, practice, and policies must adapt to a wide range of new risks while experts learn to interact with complex automated systems. Even in the early stages of Health AI, risks and gaps are being identified, like severe underperformance of models for minority groups and catastrophic model failures when input data shift over time. In the face of such gaps, we find inspiration in aviation, a field that went from highly dangerous to largely safe. We draw three main lessons from aviation safety that can apply to Health AI: 1) Build regulatory feedback loops to learn from mistakes and improve practices, 2) Establish a culture of safety and openness where stakeholders have incentives to report failures and communicate across the healthcare system, and 3) Extensively train, retrain, and accredit experts for interacting with Health AI, especially to help address automation bias and foster trust. Finally, we discuss remaining limitations in Health AI with less guidance from aviation.

ContributorsElizabeth Bondi-Kelly, Thomas Hartvigsen, Lindsay M Sanneman, Swami Sankaranarayanan, Zach Harned, Grace Wickerson, Judy Wawira Gichoya, Lauren Oakden-Rayner, Leo Anthony Celi, Matthew P Lungren, Julie A Shah, Marzyeh Ghassemi Learn more

Making the End-User a Priority in Benchmarking: OrionBench for Unsupervised Time Series Anomaly Detection

Time series anomaly detection is a prevalent problem in many application domains such as patient monitoring in healthcare, forecasting in finance, or predictive maintenance in energy. This has led to the emergence of a plethora of anomaly detection methods, including more recently, deep learning based methods. Although several benchmarks have been proposed to compare newly developed models, they usually rely on one-time execution over a limited set of datasets and the comparison is restricted to a few models. We propose OrionBench -- a user centric continuously maintained benchmark for unsupervised time series anomaly detection. The framework provides universal abstractions to represent models, extensibility to add new pipelines and datasets, hyperparameter standardization, pipeline verification, and frequent releases with published benchmarks. We demonstrate the usage of OrionBench, and the progression of pipelines across 15 releases published over the course of three years. Moreover, we walk through two real scenarios we experienced with OrionBench that highlight the importance of continuous benchmarks in unsupervised time series anomaly detection.

Contributors: Sarah Alnegheimish, Laure Berti-Equille Learn more
image description