November 03,2023

Effective Human-AI Teams via Learned Natural Language Rules and Onboarding

People are relying on AI agents to assist them with various tasks. The human must know when to rely on the agent, collaborate with the agent, or ignore its suggestions. In this work, we propose to learn rules grounded in data regions and described in natural language that illustrate how the human should collaborate with the AI. Our novel region discovery algorithm finds local regions in the data as neighborhoods in an embedding space that corrects the human prior. Each region is then described using an iterative and contrastive procedure where a large language model describes the region. We then teach these rules to the human via an onboarding stage. Through user studies on object detection and question-answering tasks, we show that our method can lead to more accurate human-AI teams. We also evaluate our region discovery and description algorithms separately.

Contributors: Hussein Mozannar, Jimin J Lee, Dennis Wei, Prasanna Sattigeri, Subhro Das Learn more

November 02,2023

A Deep Dive into Single-Cell RNA Sequencing Foundation Models

Large-scale foundation models, which are pre-trained on massive, unlabeled datasets and subsequently fine-tuned on specific tasks, have recently achieved unparalleled success on a wide array of applications, including in healthcare and biology. In this paper, we explore two foundation models recently developed for single-cell RNA sequencing data, scBERT and scGPT. Focusing on the fine-tuning task of cell type annotation, we explore the relative performance of pre-trained models compared to a simple baseline, L1-regularized logistic regression, including in the few-shot setting. We perform ablation studies to understand whether pretraining improves model performance and to better understand the difficulty of the pre-training task in scBERT. Finally, using scBERT as an example, we demonstrate the potential sensitivity of fine-tuning to hyperparameter settings and parameter initializations. Taken together, our results highlight the importance of rigorously testing foundation models against well established baselines, establishing challenging fine-tuning tasks on which to benchmark foundation models, and performing deep introspection into the embeddings learned by the model in order to more effectively harness these models to transform single-cell data analysis. Code is available at https://github.com/clinicalml/sc-foundation-eval.

Contributors: Rebecca Boiarsky, Nalini Singh, Alejandro Buendia, Gad Getz Learn more

November 01,2023

Taking Off with AI: Lessons from Aviation for Healthcare

Artificial intelligence (AI) stands to improve healthcare through innovative new systems ranging from diagnosis aids to patient tools. However, such “Health AI” systems are complicated and challenging to integrate into standing clinical practice. With advancing AI, regulations, practice, and policies must adapt to a wide range of new risks while experts learn to interact with complex automated systems. Even in the early stages of Health AI, risks and gaps are being identified, like severe underperformance of models for minority groups and catastrophic model failures when input data shift over time. In the face of such gaps, we find inspiration in aviation, a field that went from highly dangerous to largely safe. We draw three main lessons from aviation safety that can apply to Health AI: 1) Build regulatory feedback loops to learn from mistakes and improve practices, 2) Establish a culture of safety and openness where stakeholders have incentives to report failures and communicate across the healthcare system, and 3) Extensively train, retrain, and accredit experts for interacting with Health AI, especially to help address automation bias and foster trust. Finally, we discuss remaining limitations in Health AI with less guidance from aviation.

ContributorsElizabeth Bondi-Kelly, Thomas Hartvigsen, Lindsay M Sanneman, Swami Sankaranarayanan, Zach Harned, Grace Wickerson, Judy Wawira Gichoya, Lauren Oakden-Rayner, Leo Anthony Celi, Matthew P Lungren, Julie A Shah, Marzyeh Ghassemi Learn more

October 30,2023

Role of sex in lung cancer risk prediction based on single low-dose chest computed tomography

A validated open-source deep-learning algorithm called Sybil can accurately predict long-term lung cancer risk from a single low-dose chest computed tomography (LDCT). However, Sybil was trained on a majority-male cohort. Use of artificial intelligence algorithms trained on imbalanced cohorts may lead to inequitable outcomes in real-world settings. We aimed to study whether Sybil predicts lung cancer risk equally regardless of sex. We analyzed 10,573 LDCTs from 6127 consecutive lung cancer screening participants across a health system between 2015 and 2021. Sybil achieved AUCs of 0.89 (95% CI: 0.85–0.93) for females and 0.89 (95% CI: 0.85–0.94) for males at 1 year, p = 0.92. At 6 years, the AUC was 0.87 (95% CI: 0.83–0.93) for females and 0.79 (95% CI: 0.72–0.86) for males, p = 0.01. In conclusion, Sybil can accurately predict future lung cancer risk in females and males in a real-world setting and performs better in females than in males for predicting 6-year lung cancer risk.

Contributors: Judit Simon, Ismail Tahir, Alexander Graur, Stefan Ringer, Amanda Fata, Yang Chi-Fu Jeffrey, Jo-Anne Shepard, Francine Jacobson, Lecia V. Sequist, Lydia E. Pace Learn more

October 30,2023

Role of sex in lung cancer risk prediction based on single low-dose chest computed tomography

A validated open-source deep-learning algorithm called Sybil can accurately predict long-term lung cancer risk from a single low-dose chest computed tomography (LDCT). However, Sybil was trained on a majority-male cohort. Use of artificial intelligence algorithms trained on imbalanced cohorts may lead to inequitable outcomes in real-world settings. We aimed to study whether Sybil predicts lung cancer risk equally regardless of sex. We analyzed 10,573 LDCTs from 6127 consecutive lung cancer screening participants across a health system between 2015 and 2021. Sybil achieved AUCs of 0.89 (95% CI: 0.85–0.93) for females and 0.89 (95% CI: 0.85–0.94) for males at 1 year, p = 0.92. At 6 years, the AUC was 0.87 (95% CI: 0.83–0.93) for females and 0.79 (95% CI: 0.72–0.86) for males, p = 0.01. In conclusion, Sybil can accurately predict future lung cancer risk in females and males in a real-world setting and performs better in females than in males for predicting 6-year lung cancer risk.

Co-authors: Judit Simon, Peter Mikhael, Ismail Tahir, Alexander Graur, Stefan Ringer, Amanda Fata, Yang Chi-Fu Jeffrey, Jo-Anne Shepard, Francine Jacobson, Regina Barzilay, Lecia V. Sequist, Lydia E. Pace, and Florian J. Fintelmann Learn more

October 26,2023

Making the End-User a Priority in Benchmarking: OrionBench for Unsupervised Time Series Anomaly Detection

Time series anomaly detection is a prevalent problem in many application domains such as patient monitoring in healthcare, forecasting in finance, or predictive maintenance in energy. This has led to the emergence of a plethora of anomaly detection methods, including more recently, deep learning based methods. Although several benchmarks have been proposed to compare newly developed models, they usually rely on one-time execution over a limited set of datasets and the comparison is restricted to a few models. We propose OrionBench -- a user centric continuously maintained benchmark for unsupervised time series anomaly detection. The framework provides universal abstractions to represent models, extensibility to add new pipelines and datasets, hyperparameter standardization, pipeline verification, and frequent releases with published benchmarks. We demonstrate the usage of OrionBench, and the progression of pipelines across 15 releases published over the course of three years. Moreover, we walk through two real scenarios we experienced with OrionBench that highlight the importance of continuous benchmarks in unsupervised time series anomaly detection.

Contributors: Sarah Alnegheimish, Laure Berti-Equille Learn more

September 08,2023

An interpretable AI model for recurrence prediction after surgery in gastrointestinal stromal tumour: an observational cohort study

Background There are several models that predict the risk of recurrence following resection of localised, primary gastrointestinal stromal tumour (GIST). However, assessment of calibration is not always feasible and when performed, calibration of current GIST models appears to be suboptimal. We aimed to develop a prognostic model to predict the recurrence of GIST after surgery with both good discrimination and calibration by uncovering and harnessing the non-linear relationships among variables that predict recurrence.

Methods In this observational cohort study, the data of 395 adult patients who underwent complete resection (R0 or R1) of a localised, primary GIST in the pre-imatinib era at Memorial Sloan Kettering Cancer Center (NY, USA) (recruited 1982–2001) and a European consortium (Spanish Group for Research in Sarcomas, 80 sites) (recruited 1987–2011) were used to train an interpretable Artificial Intelligence (AI)-based model called Optimal Classification Trees (OCT). The OCT predicted the probability of recurrence after surgery by capturing non-linear relationships among predictors of recurrence. The data of an additional 596 patients from another European consortium (Polish Clinical GIST Registry, 7 sites) (recruited 1981–2013) who were also treated in the pre-imatinib era were used to externally validate the OCT predictions with regard to discrimination (Harrell's C-index and Brier score) and calibration (calibration curve, Brier score, and Hosmer-Lemeshow test). The calibration of the Memorial Sloan Kettering (MSK) GIST nomogram was used as a comparative gold standard. We also evaluated the clinical utility of the OCT and the MSK nomogram by performing a Decision Curve Analysis (DCA).

Findings The internal cohort included 395 patients (median [IQR] age, 63 [54–71] years; 214 men [54.2%]) and the external cohort included 556 patients (median [IQR] age, 60 [52–68] years; 308 men [55.4%]). The Harrell's C-index of the OCT in the external validation cohort was greater than that of the MSK nomogram (0.805 (95% CI: 0.803–0.808) vs 0.788 (95% CI: 0.786–0.791), respectively). In the external validation cohort, the slope and intercept of the calibration curve of the main OCT were 1.041 and 0.038, respectively. In comparison, the slope and intercept of the calibration curve for the MSK nomogram was 0.681 and 0.032, respectively. The MSK nomogram overestimated the recurrence risk throughout the entire calibration curve. Of note, the Brier score was lower for the OCT compared to the MSK nomogram (0.147 vs 0.564, respectively), and the Hosmer-Lemeshow test was insignificant (P = 0.087) for the OCT model but significant (P 50% risk of recurrence. Interpretation

We present the first prognostic models of recurrence risk in GIST that demonstrate excellent discrimination, calibration, and clinical utility on external validation. Additional studies for further validation are warranted. With further validation, these tools could potentially improve patient counseling and selection for adjuvant therapy.

Funding The NCI SPORE in Soft Tissue Sarcoma and NCI Cancer Center Support Grants.

Contributors: Georgios Antonios Margonis, Seehanah Tang, Angelos Koulouras, Cristina R. Antonescu, Murray F. Brennan, Javier Martin-Broto, Piotr Rutkowski, Georgios Stasinos, Jane Wang, Emmanouil Pikoulis, Elzbieta Bylina, Pawel Sobczuk, Antonio Gutierrez, Bhumika Jadeja, William D. Tap, Ping Chi, Samuel Singer Learn more

August 31,2023

Considering Biased Data as Informative Artifacts in AI-Assisted Health Care

Artificial intelligence (AI) tools used in medicine, like AI used in other fields, work by detecting patterns in large volumes of data. AI tools are able to detect these patterns because they can “learn,” or be trained to recognize, certain features in the data. However, medical AI tools trained with data that are skewed in some way can exhibit bias, and when that bias matches patterns of injustice, the use of the tools can lead to inequity and discrimination. Technical solutions such as attempting to fix biased clinical data used for AI training are well intentioned, but what undergirds all these initiatives is the notion that skewed clinical data are “garbage,” as in the computer science adage “garbage in, garbage out.” Instead, we propose thinking of clinical data as artifacts that, when examined, can be informative of societies and institutions in which they are found.

Contributors: Kadija Ferryman, Maxine Mackintosh Learn more

August 19,2023

Conceptualizing machine learning for dynamic information retrieval of electronic health record notes

The large amount of time clinicians spend sifting through patient notes and documenting in electronic health records(EHRs) is a leading cause of clinician burnout. By proactively and dynamically retrieving relevant notes during the documentation process, we can reduce the effort required to find relevant patient history. In this work, we conceptualize the use of EHR audit logs for machine learning as a source of supervision of note relevance in a specific clinical context, at a particular point in time. Our evaluation focuses on the dynamic retrieval in the emergency department, a high acuity setting with unique patterns of information retrieval and note writing. We show that our methods can achieve an AUC of 0.963 for predicting which notes will be read in an individual note writing session. We additionally conduct a user study with several clinicians and find that our framework can help clinicians retrieve relevant information more efficiently. Demonstrating that our framework and methods can perform well in this demanding setting is a promising proof of concept that they will translate to other clinical settings and data modalities(e.g., labs, medications, imaging). Learn more

June 22,2023

Conformal Language Modeling

In this paper, we propose a novel approach to conform al prediction for generative language models (LMs). Standard conform al prediction produces prediction sets—in place of single predictions—that have rigorous, statistical performance guarantees. LM responses are typically sampled from the model’s predicted distribution over the large, combinatorial output space of natural language. Translating this process to conform al prediction, we calibrate a stopping rule for sampling different outputs from the LM that get added to a growing set of candidates until we are confident that the output set is sufficient. Since some samples maybe low quality, we also simultaneously calibrate and apply a rejection rule for removing candidates from the output set to reduce noise. Similar to conformal prediction, we prove that the sampled set returned by our procedure contains at least one acceptable answer with high probability, while still being empirically precise (i.e.,small) on average. Furthermore, within this set of candidate responses, we show that we can also accurately identify subsets individual components—such as phrases or sentences—that are each independently correct(e.g.,thatarenot“hallucinations”), again with statistical guarantees. We demonstrate the promise of our approach on multiple tasks in open-domain question answering, text summarization, and radiology report generation using different LM variants.

Contributors: Victor Quach, Adam Fisch, Tal Schuster, Jae Ho Sohn Learn more