People are relying on AI agents to assist them with various tasks. The human must know when to rely on the agent, collaborate with the agent, or ignore its suggestions. In this work, we propose to learn rules grounded in data regions and described in natural language that illustrate how the human should collaborate with the AI. Our novel region discovery algorithm finds local regions in the data as neighborhoods in an embedding space that corrects the human prior. Each region is then described using an iterative and contrastive procedure where a large language model describes the region. We then teach these rules to the human via an onboarding stage. Through user studies on object detection and question-answering tasks, we show that our method can lead to more accurate human-AI teams. We also evaluate our region discovery and description algorithms separately.
Contributors: Hussein Mozannar, Jimin J Lee, Dennis Wei, Prasanna Sattigeri, Subhro Das Learn more
Large-scale foundation models, which are pre-trained on massive, unlabeled datasets and subsequently fine-tuned on specific tasks, have recently achieved unparalleled success on a wide array of applications, including in healthcare and biology. In this paper, we explore two foundation models recently developed for single-cell RNA sequencing data, scBERT and scGPT. Focusing on the fine-tuning task of cell type annotation, we explore the relative performance of pre-trained models compared to a simple baseline, L1-regularized logistic regression, including in the few-shot setting. We perform ablation studies to understand whether pretraining improves model performance and to better understand the difficulty of the pre-training task in scBERT. Finally, using scBERT as an example, we demonstrate the potential sensitivity of fine-tuning to hyperparameter settings and parameter initializations. Taken together, our results highlight the importance of rigorously testing foundation models against well established baselines, establishing challenging fine-tuning tasks on which to benchmark foundation models, and performing deep introspection into the embeddings learned by the model in order to more effectively harness these models to transform single-cell data analysis. Code is available at https://github.com/clinicalml/sc-foundation-eval.
There are several models that predict the risk of recurrence following resection of localised, primary gastrointestinal stromal tumour (GIST). However, assessment of calibration is not always feasible and when performed, calibration of current GIST models appears to be suboptimal. We aimed to develop a prognostic model to predict the recurrence of GIST after surgery with both good discrimination and calibration by uncovering and harnessing the non-linear relationships among variables that predict recurrence.
In this observational cohort study, the data of 395 adult patients who underwent complete resection (R0 or R1) of a localised, primary GIST in the pre-imatinib era at Memorial Sloan Kettering Cancer Center (NY, USA) (recruited 1982–2001) and a European consortium (Spanish Group for Research in Sarcomas, 80 sites) (recruited 1987–2011) were used to train an interpretable Artificial Intelligence (AI)-based model called Optimal Classification Trees (OCT). The OCT predicted the probability of recurrence after surgery by capturing non-linear relationships among predictors of recurrence. The data of an additional 596 patients from another European consortium (Polish Clinical GIST Registry, 7 sites) (recruited 1981–2013) who were also treated in the pre-imatinib era were used to externally validate the OCT predictions with regard to discrimination (Harrell's C-index and Brier score) and calibration (calibration curve, Brier score, and Hosmer-Lemeshow test). The calibration of the Memorial Sloan Kettering (MSK) GIST nomogram was used as a comparative gold standard. We also evaluated the clinical utility of the OCT and the MSK nomogram by performing a Decision Curve Analysis (DCA).
The internal cohort included 395 patients (median [IQR] age, 63 [54–71] years; 214 men [54.2%]) and the external cohort included 556 patients (median [IQR] age, 60 [52–68] years; 308 men [55.4%]). The Harrell's C-index of the OCT in the external validation cohort was greater than that of the MSK nomogram (0.805 (95% CI: 0.803–0.808) vs 0.788 (95% CI: 0.786–0.791), respectively). In the external validation cohort, the slope and intercept of the calibration curve of the main OCT were 1.041 and 0.038, respectively. In comparison, the slope and intercept of the calibration curve for the MSK nomogram was 0.681 and 0.032, respectively. The MSK nomogram overestimated the recurrence risk throughout the entire calibration curve. Of note, the Brier score was lower for the OCT compared to the MSK nomogram (0.147 vs 0.564, respectively), and the Hosmer-Lemeshow test was insignificant (P = 0.087) for the OCT model but significant (P 50% risk of recurrence.
We present the first prognostic models of recurrence risk in GIST that demonstrate excellent discrimination, calibration, and clinical utility on external validation. Additional studies for further validation are warranted. With further validation, these tools could potentially improve patient counseling and selection for adjuvant therapy.
The NCI SPORE in Soft Tissue Sarcoma and NCI Cancer Center Support Grants.
Georgios Antonios Margonis, Seehanah Tang, Angelos Koulouras, Cristina R. Antonescu, Murray F. Brennan, Javier Martin-Broto, Piotr Rutkowski, Georgios Stasinos, Jane Wang, Emmanouil Pikoulis, Elzbieta Bylina, Pawel Sobczuk, Antonio Gutierrez, Bhumika Jadeja, William D. Tap, Ping Chi, Samuel Singer Learn more
Artificial intelligence (AI) tools used in medicine, like AI used in other fields, work by detecting patterns in large volumes of data. AI tools are able to detect these patterns because they can “learn,” or be trained to recognize, certain features in the data. However, medical AI tools trained with data that are skewed in some way can exhibit bias, and when that bias matches patterns of injustice, the use of the tools can lead to inequity and discrimination. Technical solutions such as attempting to fix biased clinical data used for AI training are well intentioned, but what undergirds all these initiatives is the notion that skewed clinical data are “garbage,” as in the computer science adage “garbage in, garbage out.” Instead, we propose thinking of clinical data as artifacts that, when examined, can be informative of societies and institutions in which they are found.
Contributors: Kadija Ferryman, Maxine Mackintosh Learn more
The large amount of time clinicians spend sifting through patient notes and documenting in electronic health records(EHRs) is a leading cause of clinician burnout. By
proactively and dynamically retrieving relevant notes during the documentation process,
we can reduce the effort required to find relevant patient history. In this work, we conceptualize the use of EHR audit logs for machine learning as a source of supervision of
note relevance in a specific clinical context, at a particular point in time. Our evaluation focuses on the dynamic retrieval in the emergency department, a high acuity setting with
unique patterns of information retrieval and note writing. We show that our methods can achieve an AUC of 0.963 for predicting which notes will be read in an individual note writing session. We additionally conduct a user study with several clinicians and find that our framework can help clinicians retrieve relevant information more efficiently. Demonstrating that our framework and methods can perform well in this demanding setting is a promising proof of concept that they will translate to other clinical settings and data modalities(e.g., labs, medications, imaging). Learn more
In this paper, we propose a novel approach to conform al prediction for generative language models (LMs). Standard conform al prediction produces prediction sets—in place of single predictions—that have rigorous, statistical performance guarantees. LM responses are typically sampled from the model’s predicted distribution over the large, combinatorial output space of natural language. Translating this process to conform al prediction, we calibrate a stopping rule for sampling different outputs from the LM that get added to a growing set of candidates until we are confident that the output set is sufficient. Since some samples maybe low quality, we also simultaneously calibrate and apply a rejection rule for removing candidates from the output set to reduce noise. Similar to conformal prediction, we prove that the sampled set returned by our procedure contains at least one acceptable answer with high probability, while still being empirically precise (i.e.,small) on average. Furthermore, within this set of candidate responses, we show that we can also accurately identify subsets individual components—such as phrases or sentences—that are each independently correct(e.g.,thatarenot“hallucinations”), again with statistical guarantees. We demonstrate the promise of our approach on multiple tasks in open-domain question answering, text summarization, and radiology report generation using different LM variants.
Contributors: Victor Quach, Adam Fisch, Tal Schuster, Jae Ho Sohn Learn more
Social determinants of health (SDOH)–the conditions in which people live, grow, and age–play a crucial role in a person’s health and well-being. There is a large, compelling body of evidence in population health studies showing that a wide range of SDOH is strongly correlated with health outcomes. Yet, a majority of the risk prediction models based on electronic health records (EHR) do not incorporate a comprehensive set of SDOH features as they are often noisy or simply unavailable. Our work links a publicly available EHR database, MIMIC-IV, to well-documented SDOH features. We investigate the impact of such features on common EHR prediction tasks across different patient populations. We find that community level SDOH features do not improve model performance for a general patient population, but can improve data-limited model fairness for specific subpopulations. We also demonstrate that SDOH features are vital for conducting thorough audits of algorithmic biases beyond protective attributes. We hope the new integrated EHR-SDOH database will enable studies on the relationship between community health and individual outcomes and provide new benchmarks to study algorithmic biases beyond race, gender, and age.
Contributors: Ming Ying Yang, Gloria Hyunjung Kwak, Tom Pollard, Leo Anthony Celi Learn more
Machine learning models often perform poorly on subgroups that are under represented in the training data. Yet, little is understood on the variation in mechanisms that cause subpopulation shifts, and how algorithms generalize across such diverse shifts at scale. In this work, we provide a fine-grained analysis of subpopulation shift. We first propose a unified framework that dissects and explains common shifts in subgroups. We then establish a comprehensive benchmark of 20 state of-the-art algorithms evaluated on 12 real-world datasets invision, language, and healthcare domains. With results obtained from training over 10,000 models, we reveal intriguing observations for future progress in this space. First, existing algorithms only improve subgroup robustness over certain types of shifts but not others. Moreover, while current algorithms rely on group-annotated validation data for model selection, we find that a simple selection criterion based on worst-class accuracy is surprisingly effective even without any group information. Finally, unlike existing works that solely aim to improve worst-group accuracy (WGA), we demonstrate the fundamental tradeoff between WGA and other important metrics, highlighting the need to carefully choose testing metrics. Code and data are available at: https: //github.com/YyzHarry/SubpopBench.
Contributors: Yuzhe Yang, Haoran Zhang, Dina Katabi Learn more
Motion artifacts are a pervasive problem in MRI, leading to misdiagnosis or mischaracterization in population-level imaging studies. Current retrospective rigid intra-slice motion correction techniques jointly optimize estimates of the image and the motion parameters. In this paper, we use a deep network to reduce the joint image-motion parameter search to a search over rigid motion parameters alone. Our network produces a reconstruction as a function of two inputs: corrupted k-space data and motion parameters. We train the network using simulated, motion-corrupted k-space data generated from known motion parameters. At test-time, we estimate unknown motion parameters by minimizing a data consistency loss between the motion parameters, the network-based image reconstruction given those parameters, and the acquired measurements. Intra-slice motion correction experiments on simulated and realistic 2D fast spin echo brain MRI achieve high reconstruction fidelity while retaining the benefits of explicit data consistency-based optimization.
Contributors: Nalini M. Singh, Neel Dey, Malte Hoffmann, Bruce Fischl, Elfar Adalsteinsson, Robert Frost, Adrian V. Dalca Learn more
Low-dose computed tomography (LDCT) for lung cancer screening is effective, although most eligible people are not being screened. Tools that provide personalized future cancer risk assessment could focus approaches toward those most likely to benefit. We hypothesized that a deep learning model assessing the entire volumetric LDCT data could be built to predict individual risk without requiring additional demographic or clinical data.
We developed a model called Sybil using LDCTs from the National Lung Screening Trial (NLST). Sybil requires only one LDCT and does not require clinical data or radiologist annotations; it can run in real time in the background on a radiology reading station. Sybil was validated on three independent data sets: a heldout set of 6,282 LDCTs from NLST participants, 8,821 LDCTs from Massachusetts General Hospital (MGH), and 12,280 LDCTs from Chang Gung Memorial Hospital (CGMH, which included people with a range of smoking history including nonsmokers).
Results: Sybil achieved area under the receiver-operator curves for lung cancer prediction at 1 year of 0.92 (95% CI, 0.88 to 0.95) on NLST, 0.86 (95% CI, 0.82 to 0.90) on MGH, and 0.94 (95% CI, 0.91 to 1.00) on CGMH external validation sets. Concordance indices over 6 years were 0.75 (95% CI, 0.72 to 0.78), 0.81 (95% CI, 0.77 to 0.85), and 0.80 (95% CI, 0.75 to 0.86) for NLST, MGH, and CGMH, respectively.
Sybil can accurately predict an individual's future lung cancer risk from a single LDCT scan to further enable personalized screening. Future study is required to understand Sybil's clinical applications. Our model and annotations are publicly available.
Contributors: Peter G. Mikhael, Jeremy Wohlwend, Ludvig Karstens, Justin Xiang, Angelo K. Takigami, Patrick P. Bourgouin, PuiYee Chan, Sofiane Mrah, Wael Amayri, Yu-Hsiang Juan, Cheng-Ta Yang, Yung-Liang Wan, Gigin Lin, Lecia V. Sequist, Florian J. Fintelmann Learn more