Date of Award
Doctor of Philosophy (PhD)
Epidemiology & Biostatistics
We developed statistical methods for evaluating the added value of biomarkers for predicting binary outcomes when biomarker data has limited availability. In the first project, we considered a cost effective study design called “two-phase study”, where data on the outcome and established risk predictors was collected for all study subjects in Phase I while biomarkers were measured only for a judiciously selected subset in Phase II. Using a logistic regression model to describe the relationship between the binary outcome and risk predictors, we developed three approaches to estimating the risk distribution and summary measures of predictive accuracy. We showed that all three estimators were consistent and asymptotically normally distributed, and compared the efficiency and robustness of the three methods through extensive simulation studies and application to an ongoing biomarker study of Gestational Diabetes. We also developed a novel sampling strategy for selecting Phase II subjects towards improved efficiency for estimating measures of predictive accuracy. In the second project, we developed a statistical method for alleviating the challenge of lack of independent data to validate biomarkers for prediction, focusing on model calibration. When a well-calibrated model with only standard predictors exists, we proposed to calibrate the new model to the existing model at the stage of model development. With data collected under a case-control study design, we developed a novel constrained maximum likelihood approach to fitting logistic regression models that brought this idea to fruition. We developed large sample theory for this method, and performed extensive simulation studies to assess the impact of constraints on the odds ratio parameter estimates. We applied our method to analyze a case-control study of breast cancer nested within the Breast Cancer Detection and Demonstration Project to evaluate the added value of mammographic density for predicting the 5-year risk of breast cancer. In the third project, we extended the statistical method developed in the second project to accommodate the cross-sectional study design. By simulation studies and the analysis of Gestational Diabetes, we demonstrated that our method ensured that the model was well calibrated.
Chai, Xinglei, "Semiparametric Approaches To Developing Models For Predicting Binary Outcomes Through Data And Information Integration" (2017). Publicly Accessible Penn Dissertations. 2208.