Statistical Methods For Outcome-Dependent Sampling Designs
Genetic Association Study
Odds Ratio Estimation
My dissertation work focuses on the development of novel outcome-dependent sampling designs and statistical methods of analysis. In a biomedical cohort study for assessing association between a binary outcome variable and a set of covariates, it is common that some covariates can only be measured on a subgroup of study subjects. An important design question is which subjects to select into the subgroup towards increased statistical efficiency. Existing designs can achieve improved efficiency for estimating odds ratio parameters for the completely observed covariates. Our goal is to improve efficiency for the incomplete covariates, which is of great importance in studies where the covariates of interest cannot be fully collected. In the first two projects, we proposed a novel sampling design in a common scenario where an external model is available relating the outcome and complete covariates. Our design oversampled cases and controls whose probabilities of having their own outcome were low as predicted by the external model and at the same time matched cases and controls on complete covariates. We developed a pseudo-likelihood method for estimating odds ratio parameters. Through simulation studies and a real cohort study, we showed that our design led to reduced asymptotic variances of the odds ratio parameter estimates for both incomplete and complete covariates. In the third project, we developed a family-supplemented inverse-probability-weighted empirical likelihood approach to correcting for a type of outcome-dependent selection bias in case-control genetic association studies, where genotype data were incomplete for reasons that were related to the genotype itself. Genetic association analysis would be biased if such non-ignorable missingness were naively ignored. Our method exploited genetic data from family members to help infer missing genotype data. It jointly estimated odds ratio parameters for genetic association and missingness, where a logistic regression model was used to relate missingness with genotype and other covariates. In the estimating equation for genetic association parameters, we weighted the empirical likelihood score function based on subjects who had genotype data by the inversed probabilities that their genotype data were available. We studied large and finite sample performance of our method and applied it to a case-control study of breast cancer.