Statistical Methods For Phenotyping With Positive-Only Electronic Health Record Data

Zhang, Lingjiao

Statistical Methods For Phenotyping With Positive-Only Electronic Health Record Data

Files

Zhang_upenngdas_0175C_14271.pdf (1.25 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Epidemiology & Biostatistics

Subject

Anchor
Calibration
Electronic Health Records
Phenotyping
Positive only
Prediction Accuracy
Biostatistics

Copyright date

2022-09-09T20:20:00-07:00

Permalink

https://repository.upenn.edu/handle/20.500.14332/31613

View all metadata

Author

Zhang, Lingjiao

Abstract

Electronic Health Records-based phenotyping requires fully labeled cases and controls for model training and testing. Due to asymmetric clinical workflow, labeled cases can be much more easily identified than labeled controls. Therefore, data from a group of labeled cases and a large number of unlabeled patients, referred to as “positive-only” data, is frequently accessible with minimum requirement for labeling efforts. This dissertation focuses on statistical methods for training and validating phenotyping models using such positive-only EHR data when the labeled cases can be seen as a representative subset of all cases. In project I, we developed an anchor-variable framework and proposed an accompanying maximum likelihood approach to training a logistic phenotyping model. In project II, we developed a Chi-squared test to assess model calibration through comparing the model-free and model-based estimated number of cases among the unlabeled. We also proposed consistent estimators for predictive performance measures and studied their large sample properties. These methods provide the methodological foundation for positive-only data to be routinely used for training and validating phenotyping models. In project III, we extended the MLE method in project I to accommodate high dimensional predictors by enabling automated feature selection through a proxy phenotype that is available for all patients. We performed extensive simulation studies to assess the performance of the proposed methods and applied them to Penn Medicine EHR data to phenotype primary aldosteronism.

Advisor

Jinbo Chen

Date of degree

2020-01-01

Collection

Dissertations and Theses