Statistical Methods for Developing and Evaluating Risk Prediction Models Using Integrated Electronic Health Record Data

Loading...
Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Epidemiology and Biostatistics
Discipline
Biology
Subject
Area Under the ROC Curve (AUC)
Electronic Health Records
Integrated EHR Data
Risk Prediction
Two-Phase Design
Two-Stage Modeling
Funder
Grant number
License
Copyright date
2022
Distributor
Related resources
Author
Hasler, Jill, Schnall
Contributor
Abstract

When building risk prediction models using electronic health records (EHRs), additional data can often be made available from external sources that improves predictive accuracy. The external data is commonly available only for a small subset of patients, and the integrated EHR and external data follows a monotone missingness pattern. This dissertation focuses on building and evaluating risk prediction models using data following this data structure, leveraging ideas from the two-phase design framework. In the first project, considering low-dimensional EHR predictors, we propose efficient and robust methods for building and evaluating risk prediction models for binary outcomes. Our method efficiently uses the EHR data by modeling the availability of patients’ external data as a function of an EHR-based preliminary predictive score, leading to increased efficiency for estimating odds ratio association parameters and the area under the ROC curve (AUC). In the second project, we propose a two-stage model to accommodate high-dimensional EHR predictors, which re-calibrates an adaptive lasso model built using the EHR predictors only and further adds the external variables as predictors. We propose a novel semiparametric method for fitting the two-stage model and estimating the AUC while flexibly accounting for differential availability of the external data. Our framework makes it feasible to conduct rigorous statistical inference with high-dimensional two-phase data by summarizing the EHR predictors into an adaptive lasso-predicted score. This score can alternatively be generated using any sensible machine learning algorithm, suggesting potential extensions of our two-stage modeling approach. In the third project, we consider such extension with the random forest algorithm. We focus on estimating the value of external predictors for increasing AUC and highlight practical considerations for variance estimation given the challenge of quantifying the uncertainty contributed by fitting the random forest algorithm. We apply the proposed methods to predict mortality risk in oncology patients using EHR data from the University of Pennsylvania Health System (UPHS) and external patient survey data.

Advisor
Chen, Jinbo
Date of degree
2022
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation