Weakly-Supervised Evaluation of Medical AI Systems
Degree type
Graduate group
Discipline
Computer Sciences
Subject
data programming
evaluation
machine learning
medicine
weak supervision
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
Medical artificial intelligence (AI) systems must undergo comprehensive clinical evaluations to ensure their safety and efficacy prior to being deployed into clinical practice. Clinical trials are the gold standard for evaluating medical AI systems, but they are a substantial commitment of data collection resources. Consequently, there is a growing need for cheap and fast evaluation methods that can approximate the outcomes of clinical trials. Engineers can use these methods to conduct early-stage, low-cost assessments of novel medical AI systems, which enables informed and more efficient utilization of data collection resources. This dissertation presents weakly-supervised evaluation methods for medical AI systems. These methods are lightweight, leveraging existing (previously collected) unlabeled trial data and programmatic weak supervision (PWS) -- a data labeling paradigm based on combining noisy and cheap-to-obtain labeling heuristics defined by domain experts (e.g., clinicians). We propose two distinct weakly-supervised performance evaluation methods. The first method estimates the sensitivity/specificity of a system in the form of confidence bounds by leveraging samples labeled with high confidence via PWS. We apply our method to several clinical alarm suppression systems and demonstrate that our method yields confidence bounds that fully contain the true sensitivity/specificity. The second method estimates the robustness of a system by observing its trend in performance on a sequence of adversarially ordered datasets. These datasets are constructed from an adversarial ordering of the input data based on a Clopper Pearson confidence interval for PWS label confidences. We demonstrate the utility of this method by evaluating synthetic alarm suppression systems designed to have varying levels of accuracy across five clinical alarm datasets. An inherent challenge of these methods is the need for effective labeling heuristics, which clinicians often find difficult to design for high-dimensional medical data (e.g., images and time series). To address this, we propose an automated clinician-in-the-loop method to generate weak labels for facilitating PWS, leveraging distance functions. We demonstrate that using these generated weak labels in PWS results in labels that generally outperform those obtained using clinician-supplied labeling heuristics.
Advisor
Weimer, James