Date of Award
Doctor of Philosophy (PhD)
Epidemiology & Biostatistics
Electronic Health Records-based phenotyping requires fully labeled cases and controls for model training and testing. Due to asymmetric clinical workflow, labeled cases can be much more easily identified than labeled controls. Therefore, data from a group of labeled cases and a large number of unlabeled patients, referred to as “positive-only” data, is frequently accessible with minimum requirement for labeling efforts. This dissertation focuses on statistical methods for training and validating phenotyping models using such positive-only EHR data when the labeled cases can be seen as a representative subset of all cases. In project I, we developed an anchor-variable framework and proposed an accompanying maximum likelihood approach to training a logistic phenotyping model. In project II, we developed a Chi-squared test to assess model calibration through comparing the model-free and model-based estimated number of cases among the unlabeled. We also proposed consistent estimators for predictive performance measures and studied their large sample properties. These methods provide the methodological foundation for positive-only data to be routinely used for training and validating phenotyping models. In project III, we extended the MLE method in project I to accommodate high dimensional predictors by enabling automated feature selection through a proxy phenotype that is available for all patients. We performed extensive simulation studies to assess the performance of the proposed methods and applied them to Penn Medicine EHR data to phenotype primary aldosteronism.
Zhang, Lingjiao, "Statistical Methods For Phenotyping With Positive-Only Electronic Health Record Data" (2020). Publicly Accessible Penn Dissertations. 4592.