Date of Award


Degree Type


Degree Name

Doctor of Philosophy (PhD)

Graduate Group

Epidemiology & Biostatistics

First Advisor

Justine Shults


Medical researchers strive to collect complete information, but most studies will have some degree of missing data. We first study the situations in which we can relax well accepted conditions under which inferences that ignore missing data are valid. We partition a set of data into outcome, conditioning, and latent variables, all of which potentially affect the probability of a missing response. We describe sufficient conditions under which a complete-case estimate of the conditional cumulative distribution function of the outcome given the conditioning variable is unbiased. We use simulations on a renal transplant data set to illustrate the implications of these results. After describing when missing data can be ignored, we provide a likelihood based statistical approach that accounts for missing data in longitudinal studies, by fitting correlation structures that are plausible for measurements that may be unbalanced and unequally spaced in time. Our approach can be viewed as an extension of generalized linear models for longitudinal data that is in contrast to the generalized estimating equation approach that is semi-parametric. Key assumptions of our method include first-order ante-dependence within subjects; independence between subjects; exponential family distributions for the first outcome on each subject and for the subsequent conditional distributions; and linearity of the expectations of the conditional distributions. Our approach is appropriate for data with over-dispersion, which occurs when the variance is inflated relative to the assumed distribution. We consider a clinical trial to compare two treatments for seizures in patients using Poisson or Negative Binomial distributions. Next, we consider a study that evaluates the likelihood that a transplant center is flagged for poor performance using the Binomial distribution. For both studies, we perform simulations to assess the properties of our estimators and to compare our approach with GEE. We demonstrate that our method outperforms GEE, especially as the degree of overdispersion increases. We also provide software in R so that the interested reader can implement our method in his or her own analysis.

Included in

Biostatistics Commons