Weakly-Supervised Evaluation of Medical AI Systems

Pugh, Sydney

Weakly-Supervised Evaluation of Medical AI Systems

Files

Pugh_upenngdas_0175C_16610.pdf (2.61 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Computer and Information Science

Discipline

Data Science
Computer Sciences

Subject

artificial intelligence
data programming
evaluation
machine learning
medicine
weak supervision

Copyright date

01/01/2024

Permalink

https://repository.upenn.edu/handle/20.500.14332/60449

View all metadata

Author

Pugh, Sydney

Abstract

Medical artificial intelligence (AI) systems must undergo comprehensive clinical evaluations to ensure their safety and efficacy prior to being deployed into clinical practice. Clinical trials are the gold standard for evaluating medical AI systems, but they are a substantial commitment of data collection resources. Consequently, there is a growing need for cheap and fast evaluation methods that can approximate the outcomes of clinical trials. Engineers can use these methods to conduct early-stage, low-cost assessments of novel medical AI systems, which enables informed and more efficient utilization of data collection resources. This dissertation presents weakly-supervised evaluation methods for medical AI systems. These methods are lightweight, leveraging existing (previously collected) unlabeled trial data and programmatic weak supervision (PWS) -- a data labeling paradigm based on combining noisy and cheap-to-obtain labeling heuristics defined by domain experts (e.g., clinicians). We propose two distinct weakly-supervised performance evaluation methods. The first method estimates the sensitivity/specificity of a system in the form of confidence bounds by leveraging samples labeled with high confidence via PWS. We apply our method to several clinical alarm suppression systems and demonstrate that our method yields confidence bounds that fully contain the true sensitivity/specificity. The second method estimates the robustness of a system by observing its trend in performance on a sequence of adversarially ordered datasets. These datasets are constructed from an adversarial ordering of the input data based on a Clopper Pearson confidence interval for PWS label confidences. We demonstrate the utility of this method by evaluating synthetic alarm suppression systems designed to have varying levels of accuracy across five clinical alarm datasets. An inherent challenge of these methods is the need for effective labeling heuristics, which clinicians often find difficult to design for high-dimensional medical data (e.g., images and time series). To address this, we propose an automated clinician-in-the-loop method to generate weak labels for facilitating PWS, leveraging distance functions. We demonstrate that using these generated weak labels in PWS results in labels that generally outperform those obtained using clinician-supplied labeling heuristics.

Advisor

Lee, Insup
Weimer, James

Date of degree

2024

Collection

Dissertations and Theses