Artificial intelligence (AI) stands to improve healthcare through innovative new systems ranging from diagnosis aids to patient tools. However, such “Health AI” systems are complicated and challenging to integrate into standing clinical practice. With advancing AI, regulations, practice, and policies must adapt to a wide range of new risks while experts learn to interact with complex automated systems. Even in the early stages of Health AI, risks and gaps are being identified, like severe underperformance of models for minority groups and catastrophic model failures when input data shift over time. In the face of such gaps, we find inspiration in aviation, a field that went from highly dangerous to largely safe. We draw three main lessons from aviation safety that can apply to Health AI: 1) Build regulatory feedback loops to learn from mistakes and improve practices, 2) Establish a culture of safety and openness where stakeholders have incentives to report failures and communicate across the healthcare system, and 3) Extensively train, retrain, and accredit experts for interacting with Health AI,
especially to help address automation bias and foster trust. Finally, we discuss remaining limitations in Health AI with less guidance
from aviation.
ContributorsElizabeth Bondi-Kelly, Thomas Hartvigsen, Lindsay M Sanneman, Swami Sankaranarayanan, Zach Harned, Grace Wickerson, Judy Wawira Gichoya, Lauren Oakden-Rayner, Leo Anthony Celi, Matthew P Lungren, Julie A Shah, Marzyeh Ghassemi Learn more
A validated open-source deep-learning algorithm called Sybil can accurately predict long-term lung cancer risk from a single low-dose chest computed tomography (LDCT). However, Sybil was trained on a majority-male cohort. Use of artificial intelligence algorithms trained on imbalanced cohorts may lead to inequitable outcomes in real-world settings. We aimed to study whether Sybil predicts lung cancer risk equally regardless of sex. We analyzed 10,573 LDCTs from 6127 consecutive lung cancer screening participants across a health system between 2015 and 2021. Sybil achieved AUCs of 0.89 (95% CI: 0.85–0.93) for females and 0.89 (95% CI: 0.85–0.94) for males at 1 year, p = 0.92. At 6 years, the AUC was 0.87 (95% CI: 0.83–0.93) for females and 0.79 (95% CI: 0.72–0.86) for males, p = 0.01. In conclusion, Sybil can accurately predict future lung cancer risk in females and males in a real-world setting and performs better in females than in males for predicting 6-year lung cancer risk.
Contributors: Judit Simon, Ismail Tahir, Alexander Graur, Stefan Ringer, Amanda Fata, Yang Chi-Fu Jeffrey, Jo-Anne Shepard, Francine Jacobson, Lecia V. Sequist, Lydia E. Pace
Learn more
Time series anomaly detection is a prevalent problem in many application domains such as patient monitoring in healthcare, forecasting in finance, or predictive maintenance in energy. This has led to the emergence of a plethora of anomaly detection methods, including more recently, deep learning based methods. Although several benchmarks have been proposed to compare newly developed models, they usually rely on one-time execution over a limited set of datasets and the comparison is restricted to a few models. We propose OrionBench -- a user centric continuously maintained benchmark for unsupervised time series anomaly detection. The framework provides universal abstractions to represent models, extensibility to add new pipelines and datasets, hyperparameter standardization, pipeline verification, and frequent releases with published benchmarks. We demonstrate the usage of OrionBench, and the progression of pipelines across 15 releases published over the course of three years. Moreover, we walk through two real scenarios we experienced with OrionBench that highlight the importance of continuous benchmarks in unsupervised time series anomaly detection.
Contributors: Sarah Alnegheimish, Laure Berti-Equille Learn more
Background
There are several models that predict the risk of recurrence following resection of localised, primary gastrointestinal stromal tumour (GIST). However, assessment of calibration is not always feasible and when performed, calibration of current GIST models appears to be suboptimal. We aimed to develop a prognostic model to predict the recurrence of GIST after surgery with both good discrimination and calibration by uncovering and harnessing the non-linear relationships among variables that predict recurrence.
Methods
In this observational cohort study, the data of 395 adult patients who underwent complete resection (R0 or R1) of a localised, primary GIST in the pre-imatinib era at Memorial Sloan Kettering Cancer Center (NY, USA) (recruited 1982–2001) and a European consortium (Spanish Group for Research in Sarcomas, 80 sites) (recruited 1987–2011) were used to train an interpretable Artificial Intelligence (AI)-based model called Optimal Classification Trees (OCT). The OCT predicted the probability of recurrence after surgery by capturing non-linear relationships among predictors of recurrence. The data of an additional 596 patients from another European consortium (Polish Clinical GIST Registry, 7 sites) (recruited 1981–2013) who were also treated in the pre-imatinib era were used to externally validate the OCT predictions with regard to discrimination (Harrell's C-index and Brier score) and calibration (calibration curve, Brier score, and Hosmer-Lemeshow test). The calibration of the Memorial Sloan Kettering (MSK) GIST nomogram was used as a comparative gold standard. We also evaluated the clinical utility of the OCT and the MSK nomogram by performing a Decision Curve Analysis (DCA).
Findings
The internal cohort included 395 patients (median [IQR] age, 63 [54–71] years; 214 men [54.2%]) and the external cohort included 556 patients (median [IQR] age, 60 [52–68] years; 308 men [55.4%]). The Harrell's C-index of the OCT in the external validation cohort was greater than that of the MSK nomogram (0.805 (95% CI: 0.803–0.808) vs 0.788 (95% CI: 0.786–0.791), respectively). In the external validation cohort, the slope and intercept of the calibration curve of the main OCT were 1.041 and 0.038, respectively. In comparison, the slope and intercept of the calibration curve for the MSK nomogram was 0.681 and 0.032, respectively. The MSK nomogram overestimated the recurrence risk throughout the entire calibration curve. Of note, the Brier score was lower for the OCT compared to the MSK nomogram (0.147 vs 0.564, respectively), and the Hosmer-Lemeshow test was insignificant (P = 0.087) for the OCT model but significant (P 50% risk of recurrence.
Interpretation
We present the first prognostic models of recurrence risk in GIST that demonstrate excellent discrimination, calibration, and clinical utility on external validation. Additional studies for further validation are warranted. With further validation, these tools could potentially improve patient counseling and selection for adjuvant therapy.
Funding
The NCI SPORE in Soft Tissue Sarcoma and NCI Cancer Center Support Grants.
Contributors:
Georgios Antonios Margonis, Seehanah Tang, Angelos Koulouras, Cristina R. Antonescu, Murray F. Brennan, Javier Martin-Broto, Piotr Rutkowski, Georgios Stasinos, Jane Wang, Emmanouil Pikoulis, Elzbieta Bylina, Pawel Sobczuk, Antonio Gutierrez, Bhumika Jadeja, William D. Tap, Ping Chi, Samuel Singer Learn more
Artificial intelligence (AI) tools used in medicine, like AI used in other fields, work by detecting patterns in large volumes of data. AI tools are able to detect these patterns because they can “learn,” or be trained to recognize, certain features in the data. However, medical AI tools trained with data that are skewed in some way can exhibit bias, and when that bias matches patterns of injustice, the use of the tools can lead to inequity and discrimination. Technical solutions such as attempting to fix biased clinical data used for AI training are well intentioned, but what undergirds all these initiatives is the notion that skewed clinical data are “garbage,” as in the computer science adage “garbage in, garbage out.” Instead, we propose thinking of clinical data as artifacts that, when examined, can be informative of societies and institutions in which they are found.
Contributors: Kadija Ferryman, Maxine Mackintosh Learn more
The large amount of time clinicians spend sifting through patient notes and documenting in electronic health records(EHRs) is a leading cause of clinician burnout. By
proactively and dynamically retrieving relevant notes during the documentation process,
we can reduce the effort required to find relevant patient history. In this work, we conceptualize the use of EHR audit logs for machine learning as a source of supervision of
note relevance in a specific clinical context, at a particular point in time. Our evaluation focuses on the dynamic retrieval in the emergency department, a high acuity setting with
unique patterns of information retrieval and note writing. We show that our methods can achieve an AUC of 0.963 for predicting which notes will be read in an individual note writing session. We additionally conduct a user study with several clinicians and find that our framework can help clinicians retrieve relevant information more efficiently. Demonstrating that our framework and methods can perform well in this demanding setting is a promising proof of concept that they will translate to other clinical settings and data modalities(e.g., labs, medications, imaging). Learn more
In this paper, we propose a novel approach to conform al prediction for generative language models (LMs). Standard conform al prediction produces prediction sets—in place of single predictions—that have rigorous, statistical performance guarantees. LM responses are typically sampled from the model’s predicted distribution over the large, combinatorial output space of natural language. Translating this process to conform al prediction, we calibrate a stopping rule for sampling different outputs from the LM that get added to a growing set of candidates until we are confident that the output set is sufficient. Since some samples maybe low quality, we also simultaneously calibrate and apply a rejection rule for removing candidates from the output set to reduce noise. Similar to conformal prediction, we prove that the sampled set returned by our procedure contains at least one acceptable answer with high probability, while still being empirically precise (i.e.,small) on average. Furthermore, within this set of candidate responses, we show that we can also accurately identify subsets individual components—such as phrases or sentences—that are each independently correct(e.g.,thatarenot“hallucinations”), again with statistical guarantees. We demonstrate the promise of our approach on multiple tasks in open-domain question answering, text summarization, and radiology report generation using different LM variants.
Contributors: Victor Quach, Adam Fisch, Tal Schuster, Jae Ho Sohn Learn more
Self-supervised learning (SSL) for clinical time series data has received significant attention in recent literature, since these data are highly rich and provide important information about a patient’s physiological state. However, most existing SSL methods for clinical time series are limited in that they are designed for unimodal time series, such as a sequence of structured features (e.g., lab values and vitals signs) or an individual high-dimensional physiological signal (e.g., an electrocardiogram). These existing methods cannot be readily extended to model time series that exhibit multimodality, with structured features and high-dimensional data being recorded at each timestep in the sequence. In this work, we address this gap and propose a new SSL method — Sequential Multi-Dimensional SSL — where a SSL loss is applied both at the level of the entire sequence and at the level of the individual high-dimensional data points in the sequence in order to better capture information at both scales. Our strategy is agnostic to the specific form of loss function used at each level – it can be contrastive, as in SimCLR, or non-contrastive, as in VICReg. We evaluate our method on two real-world clinical datasets, where the time series contains sequences of (1) high-frequency electrocardiograms and (2) structured data from lab values and vitals signs. Our experimental results indicate that pre-training with our method and then fine-tuning on downstream tasks improves performance over baselines on both datasets, and in several settings, can lead to improvements across different self-supervised loss functions.
Contributors: Aniruddh Raghu, Payal Chandak, Ridwan Alam, John Guttag Learn more
Social determinants of health (SDOH)–the conditions in which people live, grow, and age–play a crucial role in a person’s health and well-being. There is a large, compelling body of evidence in population health studies showing that a wide range of SDOH is strongly correlated with health outcomes. Yet, a majority of the risk prediction models based on electronic health records (EHR) do not incorporate a comprehensive set of SDOH features as they are often noisy or simply unavailable. Our work links a publicly available EHR database, MIMIC-IV, to well-documented SDOH features. We investigate the impact of such features on common EHR prediction tasks across different patient populations. We find that community level SDOH features do not improve model performance for a general patient population, but can improve data-limited model fairness for specific subpopulations. We also demonstrate that SDOH features are vital for conducting thorough audits of algorithmic biases beyond protective attributes. We hope the new integrated EHR-SDOH database will enable studies on the relationship between community health and individual outcomes and provide new benchmarks to study algorithmic biases beyond race, gender, and age.
Contributors: Ming Ying Yang, Gloria Hyunjung Kwak, Tom Pollard, Leo Anthony Celi Learn more
Machine learning models often perform poorly on subgroups that are under represented in the training data. Yet, little is understood on the variation in mechanisms that cause subpopulation shifts, and how algorithms generalize across such diverse shifts at scale. In this work, we provide a fine-grained analysis of subpopulation shift. We first propose a unified framework that dissects and explains common shifts in subgroups. We then establish a comprehensive benchmark of 20 state of-the-art algorithms evaluated on 12 real-world datasets invision, language, and healthcare domains. With results obtained from training over 10,000 models, we reveal intriguing observations for future progress in this space. First, existing algorithms only improve subgroup robustness over certain types of shifts but not others. Moreover, while current algorithms rely on group-annotated validation data for model selection, we find that a simple selection criterion based on worst-class accuracy is surprisingly effective even without any group information. Finally, unlike existing works that solely aim to improve worst-group accuracy (WGA), we demonstrate the fundamental tradeoff between WGA and other important metrics, highlighting the need to carefully choose testing metrics. Code and data are available at: https: //github.com/YyzHarry/SubpopBench.
Contributors: Yuzhe Yang, Haoran Zhang, Dina Katabi Learn more