Skip to Content

Focus Area: Clinical AI

Integrating Technology into Undergraduate Medical Education: Can Affective Computing Help Teach Empathy?

To the Editor:

Substance use disorders (SUDs) and overdose deaths continue at record levels in the USA. One major barrier to adequate treatment is the stigma attached to the condition. Evidence suggests that clinicians have more negative attitudes and less empathy toward patients with SUDs compared to other medical and mental health conditions, thereby affecting the overall quality of care these patients receive [1]. Stigma can become apparent during clinical interactions where providers may unintentionally convey negative emotions or judgments through their facial expressions.

Until recently, empathy toward this patient population was previously thought of as an inherent trait that could not be taught. However, studies in the medical literature have shown that medical trainees do have the capability to improve their empathy toward patients [2]. Given that a physician’s ability to communicate effectively is associated with better patient outcomes, it is imperative to educate future physicians about how stigma manifests in the clinical setting and the importance of empathetic communication.

A promising approach to achieving this goal is through a technology called affective computing, also called emotional artificial intelligence. Affective computing enables computers to recognize, interpret, process, and simulate human emotion. Researchers from the MIT Media Lab at the Massachusetts Institute of Technology and Weill Cornell Medical College have developed Medship, a computerized training module. Medship leverages affective computing to educate future medical providers about the stigma toward patients with SUDs. It offers interactions with virtual (i.e., computerized) patients who have a SUD, records the user in such interaction, and then simultaneously analyzes the user’s facial expressions to provide feedback on such expressions in real time. The software used was OpenFace, a lightweight, open source toolkit used for facial behavior analysis.

This project is being split into two studies. The initial study aimed to evaluate the usability and acceptability of Medship among medical students. Given the multitude of educational options currently available to medical students, their willingness to adopt the application is pivotal to its success. The second part of this project will be a randomized control trial to assess the module’s impact on decreasing negative attitudes to this patient population and will be critical in evaluating its efficacy.

The initial study of this project used a quantitative interventional design, including a cross-sectional survey following a single session of using Medship. The Institutional Review Board at Weill Cornell Medical College granted approval for the study protocol. The online link for Medship was emailed to medical students during the course of their regular education. All feedback as contained in the module. A total of 26 students at Weill Cornell Medical College participated, providing anonymous responses to demographic questions, a System Usability Scale [3] and a System Quality Scale [4]. Usability refers to the ease of using the module, while acceptability gauges students’ willingness to integrate the module into their medical curriculum.

The results from this pilot study demonstrated positive feedback. Regarding usability, all students found it easy to learn and navigate the module. Most students reported that the module was both enjoyable and user-friendly (n = 20; 77%) and found the graphics to be of high quality and resolution (n = 25; 96%). Participants assigned an average of 85 on the System Usability Scale, where a score of 73 or above indicates satisfactory usability. Regarding acceptability, each student believed that their medical institution should offer Medship as part of the educational curriculum, and a substantial portion felt that medical students would greatly benefit from using the module (n = 20; 77%). On the System Quality Scale, participants rated the module an average of 4, where a score of 3 or higher indicates satisfactory acceptability.

One limitation of Medship is its potential lack of cultural diversity in the inputs that it receives to use in its algorithms that analyze facial action units. Expression of empathy in Western culture often assumes a “one-size-fits-all” approach without taking into consideration intercultural contexts. The current version of Medship is limited from a diversity standpoint in terms of the number of inputs it has from users coming from different backgrounds, cultures, and ethnicities. Future iterations of Medship must address this to enhance external validity.

Previous research has revealed that patients with major depression perceive neutral faces as sad compared to healthy participants who interpret them as happy [5]. It raises the question of whether patients with SUDs might exhibit distinct perceptions of neutral faces, particularly in light of the common comorbidity of SUDs and mood disorders. While one approach could be controlling for these comorbidities, a more clinically valuable direction could be to develop a unique version of Medship addressing patients with SUDs and specific comorbidities.

SUDs are becoming increasingly prevalent, remain significantly undertreated, and are stigmatized by clinicians more so than other medical and psychiatric illnesses. Affective computing is gaining prominence across industries, and the field of medicine is now exploring both its safety and efficacy in enhancing patient care. Medship has the capability of improving empathetic communication between providers and their patients. The first iteration of this study has revealed positive results in terms of the technology’s usability and acceptability by medical students, and the next portion of this study will focus on assessing Medship’s efficacy as an application.

Contributors Michael Woods, Giselle Appel, Aidana Daulbayeva, Caleb Harris, Julia Iyasere, Jonathan Avery Learn more

Adaptive Optimization for Prediction with Missing Data

When training predictive models on data with missing entries, the most widely used and versatile approach is a pipeline technique where we first impute missing entries and then compute predictions. In this paper, we view prediction with missing data as a two-stage adaptive optimization problem and propose a new class of models, adaptive linear regression models, where the regression coefficients adapt to the set of observed features. We show that some adaptive linear regression models are equivalent to learning an imputation rule and a downstream linear regression model simultaneously instead of sequentially. We leverage this joint-impute-then-regress interpretation to generalize our framework to non-linear models. In settings where data is strongly not missing at random, our methods achieve a 2–10% improvement in out-of-sample accuracy.

Contributors: Arthur Delarue, Jean Pauphilet Learn more

Probing the Limits and Capabilities of Diffusion Models for the Anatomic Editing of Digital Twins

Numerical simulations can model the physical processes that govern cardiovascular device deployment. When such simulations incorporate digital twins; computational models of patient-specific anatomy, they can expedite and de-risk the device design process. Nonetheless, the exclusive use of patient-specific data constrains the anatomic variability which can be precisely or fully explored. In this study, we investigate the capacity of Latent Diffusion Models (LDMs) to edit digital twins to create anatomic variants, which we term digital siblings. Digital twins and their corresponding siblings can serve as the basis for comparative simulations, enabling the study of how subtle anatomic variations impact the simulated deployment of cardiovascular devices, as well as the augmentation of virtual cohorts for device assessment. However, while diffusion models have been characterized in their ability to edit natural images, their capacity to anatomically edit digital twins has yet to be studied. Using a case example centered on 3D digital twins of cardiac anatomy, we implement various methods for generating digital siblings and characterize them through morphological and topological analyses. We specifically edit digital twins to introduce anatomic variation at different spatial scales and within localized regions, demonstrating the existence of bias towards common anatomic features. We further show that such anatomic bias can be leveraged for virtual cohort augmentation through selective editing, partially alleviating issues related to dataset imbalance and lack of diversity. Our experimental framework thus delineates the limits and capabilities of using latent diffusion models in synthesizing anatomic variation for in silico trials.

Contributors: Karim Kadry, Shreya Gupta, Farhad R. Nezami Learn more

Wearable Technology in Clinical Practice for Depressive Disorder

Joe, who has received a diagnosis of major depressive disorder, is meeting every 2 weeks with his psychiatrist, Sandy.

“How have you been feeling since we last met?” asks Sandy.

“Much better,” says Joe. “I’ve been much more active and social, and I’m sleeping great!”

“That’s wonderful,” says Sandy. “But…I think your wearable must be broken. The data from it looks very irregular for your sleep these past 2 weeks.”

“Oh,” says Joe, “it’s not broken. Actually, now that you mention it, my sleep has been really messed up. I slept well only yesterday.”

“Well,” asks Sandy, “should we talk more about how we can improve your sleep?”

This conversation is based on a real patient–therapist interaction. In this case, the data from wearable technology served as a prompt to obtain details of the patient’s life that might have otherwise been missed. Traditional clinical assessments depend on patient recall. Although such recall can include important factors that wearable technology (often termed “wearables”) do not detect, such as patients’ reports of distress, the assessments by wearables of longitudinal data from daily life may augment methods of monitoring and treating depression, providing objective complements to subjective information from patients.

Contributors: Szymon Fedor, Robert Lewis., Paola Pedrelli, David Mischoulon, Joshua Curtiss Learn more

Effective Human-AI Teams via Learned Natural Language Rules and Onboarding

People are relying on AI agents to assist them with various tasks. The human must know when to rely on the agent, collaborate with the agent, or ignore its suggestions. In this work, we propose to learn rules grounded in data regions and described in natural language that illustrate how the human should collaborate with the AI. Our novel region discovery algorithm finds local regions in the data as neighborhoods in an embedding space that corrects the human prior. Each region is then described using an iterative and contrastive procedure where a large language model describes the region. We then teach these rules to the human via an onboarding stage. Through user studies on object detection and question-answering tasks, we show that our method can lead to more accurate human-AI teams. We also evaluate our region discovery and description algorithms separately.

Contributors: Hussein Mozannar, Jimin J Lee, Dennis Wei, Prasanna Sattigeri, Subhro Das Learn more

A Deep Dive into Single-Cell RNA Sequencing Foundation Models

Large-scale foundation models, which are pre-trained on massive, unlabeled datasets and subsequently fine-tuned on specific tasks, have recently achieved unparalleled success on a wide array of applications, including in healthcare and biology. In this paper, we explore two foundation models recently developed for single-cell RNA sequencing data, scBERT and scGPT. Focusing on the fine-tuning task of cell type annotation, we explore the relative performance of pre-trained models compared to a simple baseline, L1-regularized logistic regression, including in the few-shot setting. We perform ablation studies to understand whether pretraining improves model performance and to better understand the difficulty of the pre-training task in scBERT. Finally, using scBERT as an example, we demonstrate the potential sensitivity of fine-tuning to hyperparameter settings and parameter initializations. Taken together, our results highlight the importance of rigorously testing foundation models against well established baselines, establishing challenging fine-tuning tasks on which to benchmark foundation models, and performing deep introspection into the embeddings learned by the model in order to more effectively harness these models to transform single-cell data analysis. Code is available at https://github.com/clinicalml/sc-foundation-eval.

Contributors: Rebecca Boiarsky, Nalini Singh, Alejandro Buendia, Gad Getz Learn more

Taking Off with AI: Lessons from Aviation for Healthcare

Artificial intelligence (AI) stands to improve healthcare through innovative new systems ranging from diagnosis aids to patient tools. However, such “Health AI” systems are complicated and challenging to integrate into standing clinical practice. With advancing AI, regulations, practice, and policies must adapt to a wide range of new risks while experts learn to interact with complex automated systems. Even in the early stages of Health AI, risks and gaps are being identified, like severe underperformance of models for minority groups and catastrophic model failures when input data shift over time. In the face of such gaps, we find inspiration in aviation, a field that went from highly dangerous to largely safe. We draw three main lessons from aviation safety that can apply to Health AI: 1) Build regulatory feedback loops to learn from mistakes and improve practices, 2) Establish a culture of safety and openness where stakeholders have incentives to report failures and communicate across the healthcare system, and 3) Extensively train, retrain, and accredit experts for interacting with Health AI, especially to help address automation bias and foster trust. Finally, we discuss remaining limitations in Health AI with less guidance from aviation.

ContributorsElizabeth Bondi-Kelly, Thomas Hartvigsen, Lindsay M Sanneman, Swami Sankaranarayanan, Zach Harned, Grace Wickerson, Judy Wawira Gichoya, Lauren Oakden-Rayner, Leo Anthony Celi, Matthew P Lungren, Julie A Shah, Marzyeh Ghassemi Learn more

Making the End-User a Priority in Benchmarking: OrionBench for Unsupervised Time Series Anomaly Detection

Time series anomaly detection is a prevalent problem in many application domains such as patient monitoring in healthcare, forecasting in finance, or predictive maintenance in energy. This has led to the emergence of a plethora of anomaly detection methods, including more recently, deep learning based methods. Although several benchmarks have been proposed to compare newly developed models, they usually rely on one-time execution over a limited set of datasets and the comparison is restricted to a few models. We propose OrionBench -- a user centric continuously maintained benchmark for unsupervised time series anomaly detection. The framework provides universal abstractions to represent models, extensibility to add new pipelines and datasets, hyperparameter standardization, pipeline verification, and frequent releases with published benchmarks. We demonstrate the usage of OrionBench, and the progression of pipelines across 15 releases published over the course of three years. Moreover, we walk through two real scenarios we experienced with OrionBench that highlight the importance of continuous benchmarks in unsupervised time series anomaly detection.

Contributors: Sarah Alnegheimish, Laure Berti-Equille Learn more

An interpretable AI model for recurrence prediction after surgery in gastrointestinal stromal tumour: an observational cohort study

Background There are several models that predict the risk of recurrence following resection of localised, primary gastrointestinal stromal tumour (GIST). However, assessment of calibration is not always feasible and when performed, calibration of current GIST models appears to be suboptimal. We aimed to develop a prognostic model to predict the recurrence of GIST after surgery with both good discrimination and calibration by uncovering and harnessing the non-linear relationships among variables that predict recurrence.

Methods In this observational cohort study, the data of 395 adult patients who underwent complete resection (R0 or R1) of a localised, primary GIST in the pre-imatinib era at Memorial Sloan Kettering Cancer Center (NY, USA) (recruited 1982–2001) and a European consortium (Spanish Group for Research in Sarcomas, 80 sites) (recruited 1987–2011) were used to train an interpretable Artificial Intelligence (AI)-based model called Optimal Classification Trees (OCT). The OCT predicted the probability of recurrence after surgery by capturing non-linear relationships among predictors of recurrence. The data of an additional 596 patients from another European consortium (Polish Clinical GIST Registry, 7 sites) (recruited 1981–2013) who were also treated in the pre-imatinib era were used to externally validate the OCT predictions with regard to discrimination (Harrell's C-index and Brier score) and calibration (calibration curve, Brier score, and Hosmer-Lemeshow test). The calibration of the Memorial Sloan Kettering (MSK) GIST nomogram was used as a comparative gold standard. We also evaluated the clinical utility of the OCT and the MSK nomogram by performing a Decision Curve Analysis (DCA).

Findings The internal cohort included 395 patients (median [IQR] age, 63 [54–71] years; 214 men [54.2%]) and the external cohort included 556 patients (median [IQR] age, 60 [52–68] years; 308 men [55.4%]). The Harrell's C-index of the OCT in the external validation cohort was greater than that of the MSK nomogram (0.805 (95% CI: 0.803–0.808) vs 0.788 (95% CI: 0.786–0.791), respectively). In the external validation cohort, the slope and intercept of the calibration curve of the main OCT were 1.041 and 0.038, respectively. In comparison, the slope and intercept of the calibration curve for the MSK nomogram was 0.681 and 0.032, respectively. The MSK nomogram overestimated the recurrence risk throughout the entire calibration curve. Of note, the Brier score was lower for the OCT compared to the MSK nomogram (0.147 vs 0.564, respectively), and the Hosmer-Lemeshow test was insignificant (P = 0.087) for the OCT model but significant (P 50% risk of recurrence. Interpretation

We present the first prognostic models of recurrence risk in GIST that demonstrate excellent discrimination, calibration, and clinical utility on external validation. Additional studies for further validation are warranted. With further validation, these tools could potentially improve patient counseling and selection for adjuvant therapy.

Funding The NCI SPORE in Soft Tissue Sarcoma and NCI Cancer Center Support Grants.

Contributors: Georgios Antonios Margonis, Seehanah Tang, Angelos Koulouras, Cristina R. Antonescu, Murray F. Brennan, Javier Martin-Broto, Piotr Rutkowski, Georgios Stasinos, Jane Wang, Emmanouil Pikoulis, Elzbieta Bylina, Pawel Sobczuk, Antonio Gutierrez, Bhumika Jadeja, William D. Tap, Ping Chi, Samuel Singer Learn more

Considering Biased Data as Informative Artifacts in AI-Assisted Health Care

Artificial intelligence (AI) tools used in medicine, like AI used in other fields, work by detecting patterns in large volumes of data. AI tools are able to detect these patterns because they can “learn,” or be trained to recognize, certain features in the data. However, medical AI tools trained with data that are skewed in some way can exhibit bias, and when that bias matches patterns of injustice, the use of the tools can lead to inequity and discrimination. Technical solutions such as attempting to fix biased clinical data used for AI training are well intentioned, but what undergirds all these initiatives is the notion that skewed clinical data are “garbage,” as in the computer science adage “garbage in, garbage out.” Instead, we propose thinking of clinical data as artifacts that, when examined, can be informative of societies and institutions in which they are found.

Contributors: Kadija Ferryman, Maxine Mackintosh Learn more
image description