Date of Award


Degree Type


Degree Name

Doctor of Philosophy (PhD)

Graduate Group

Epidemiology & Biostatistics

First Advisor

Jinbo Chen


A major challenge in using Electronic Health Record (EHR) data for clinical research is accurate identification of patients with respect to the true phenotype status. This dissertation focuses on development of novel study designs and statistical methods to address this challenge. In the first project, we proposed an estimating equation approach to correcting case contamination bias in EHR-based case-control studies. While the stringency of rules for identifying cases needs to be balanced against the sample size and representativeness of those selected, the inaccurate identification results in a candidate case pool consisting of genuine cases and non-case subjects who do not satisfy control definition. Ignoring case contamination would lead to biased estimation of odds ratio parameters. We showed that our proposed method resulted in estimates that were consistent and asymptotically normally distributed, and evaluated its efficiency and robustness through simulation studies and application to an EHR-based study of aortic stenosis. We also explored the balance between identification accuracy and size of the case pool in relation to study power. In the second project, we proposed a novel two-stage approach to developing risk prediction models for supporting clinical decision-making. We first modeled physicians' clinical decisions on similar conditions as recorded in the EHRs. Then we developed the risk prediction model in an EHR dataset with gold-standard phenotype status using the projected clinical decision by the first stage model as a predictor. We developed the large sample theory for the estimated risk and summary measures of predictive accuracy. Results from extensive simulation studies and analysis of a Penn Medicine EHR cohort of primary aldosteronism showed that our two-stage approach can substantially improve predictive accuracy. Our approach therefore effectively addressed a major hurdle in EHR-based risk prediction where the sample size of patients with gold-standard phenotype status was most often limited. In the third project, for characterizing associations discovered in phenome-wide association studies, we proposed a genotype stratified case-control sampling strategy to select patients for phenotype validation. We developed a close-form maximum-likelihood estimator for odds ratio association parameters and a score statistic for testing association, and showed that our sampling strategy led to increased test power.

Included in

Biostatistics Commons