Statistical Methods For Analyzing The Electronic Health Records Data

Loading...
Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Epidemiology & Biostatistics
Discipline
Subject
Biostatistics
Funder
Grant number
License
Copyright date
2019-10-23T20:19:00-07:00
Distributor
Related resources
Author
Contributor
Abstract

A major challenge in using Electronic Health Record (EHR) data for clinical research is accurate identification of patients with respect to the true phenotype status. This dissertation focuses on development of novel study designs and statistical methods to address this challenge. In the first project, we proposed an estimating equation approach to correcting case contamination bias in EHR-based case-control studies. While the stringency of rules for identifying cases needs to be balanced against the sample size and representativeness of those selected, the inaccurate identification results in a candidate case pool consisting of genuine cases and non-case subjects who do not satisfy control definition. Ignoring case contamination would lead to biased estimation of odds ratio parameters. We showed that our proposed method resulted in estimates that were consistent and asymptotically normally distributed, and evaluated its efficiency and robustness through simulation studies and application to an EHR-based study of aortic stenosis. We also explored the balance between identification accuracy and size of the case pool in relation to study power. In the second project, we proposed a novel two-stage approach to developing risk prediction models for supporting clinical decision-making. We first modeled physicians' clinical decisions on similar conditions as recorded in the EHRs. Then we developed the risk prediction model in an EHR dataset with gold-standard phenotype status using the projected clinical decision by the first stage model as a predictor. We developed the large sample theory for the estimated risk and summary measures of predictive accuracy. Results from extensive simulation studies and analysis of a Penn Medicine EHR cohort of primary aldosteronism showed that our two-stage approach can substantially improve predictive accuracy. Our approach therefore effectively addressed a major hurdle in EHR-based risk prediction where the sample size of patients with gold-standard phenotype status was most often limited. In the third project, for characterizing associations discovered in phenome-wide association studies, we proposed a genotype stratified case-control sampling strategy to select patients for phenotype validation. We developed a close-form maximum-likelihood estimator for odds ratio association parameters and a score statistic for testing association, and showed that our sampling strategy led to increased test power.

Advisor
Jinbo Chen
Date of degree
2019-01-01
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation