Statistical Methods for Developing and Evaluating Risk Prediction Models Using Integrated Electronic Health Record Data

Hasler, Jill, Schnall

Statistical Methods for Developing and Evaluating Risk Prediction Models Using Integrated Electronic Health Record Data

Files

Hasler_upenngdas_0175C_15497.pdf (969.93 KB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Epidemiology and Biostatistics

Discipline

Biology

Subject

Area Under the ROC Curve (AUC)
Electronic Health Records
Integrated EHR Data
Risk Prediction
Two-Phase Design
Two-Stage Modeling

Copyright date

2022

Permalink

https://repository.upenn.edu/handle/20.500.14332/59653

View all metadata

Author

Hasler, Jill, Schnall

Abstract

When building risk prediction models using electronic health records (EHRs), additional data can often be made available from external sources that improves predictive accuracy. The external data is commonly available only for a small subset of patients, and the integrated EHR and external data follows a monotone missingness pattern. This dissertation focuses on building and evaluating risk prediction models using data following this data structure, leveraging ideas from the two-phase design framework. In the first project, considering low-dimensional EHR predictors, we propose efficient and robust methods for building and evaluating risk prediction models for binary outcomes. Our method efficiently uses the EHR data by modeling the availability of patients’ external data as a function of an EHR-based preliminary predictive score, leading to increased efficiency for estimating odds ratio association parameters and the area under the ROC curve (AUC). In the second project, we propose a two-stage model to accommodate high-dimensional EHR predictors, which re-calibrates an adaptive lasso model built using the EHR predictors only and further adds the external variables as predictors. We propose a novel semiparametric method for fitting the two-stage model and estimating the AUC while flexibly accounting for differential availability of the external data. Our framework makes it feasible to conduct rigorous statistical inference with high-dimensional two-phase data by summarizing the EHR predictors into an adaptive lasso-predicted score. This score can alternatively be generated using any sensible machine learning algorithm, suggesting potential extensions of our two-stage modeling approach. In the third project, we consider such extension with the random forest algorithm. We focus on estimating the value of external predictors for increasing AUC and highlight practical considerations for variance estimation given the challenge of quantifying the uncertainty contributed by fitting the random forest algorithm. We apply the proposed methods to predict mortality risk in oncology patients using EHR data from the University of Pennsylvania Health System (UPHS) and external patient survey data.

Advisor

Chen, Jinbo

Date of degree

2022

Collection

Dissertations and Theses