Date of Award


Degree Type


Degree Name

Doctor of Philosophy (PhD)

Graduate Group

Epidemiology & Biostatistics

First Advisor

Sharon X. Xie


When a true survival endpoint cannot be assessed for some subjects, an alternative endpoint that measures the true endpoint with error may be collected, which often occurs when the true endpoint is too invasive or costly to obtain. We develop nonparametric and semiparametric estimated likelihood functions that incorporate both uncertain endpoints available for all participants and true endpoints available for only a subset of participants. We propose maximum estimated likelihood estimators of the discrete survival function of time to the true endpoint and of a hazard ratio representing the effect of a binary or continuous covariate assuming a proportional hazards model. We show that the proposed estimators are consistent and asymptotically normal and develop the analytical forms of the variance estimators. Through extensive simulations, we also show that the proposed estimators have little bias compared to the naïve estimator, which uses only uncertain endpoints, and are more efficient with moderate missingness compared to the complete-case estimator, which uses only available true endpoints. We illustrate the proposed method by estimating the risk of developing Alzheimer's disease using data from the Alzheimer's Disease Neuroimaging Initiative. Using our proposed semiparametric estimator, we develop optimal study design strategies to compare survival across treatment groups for a new trial with these data characteristics. We demonstrate how to calculate the optimal number of true events in the validation set with desired power using simulated data when assuming the baseline distribution of the true event, effect size, correlation between outcomes, and proportion of true outcomes that are missing can be estimated from pilot studies. We also propose a sample size formula that does not depend on baseline distribution of the true event and show that power calculated by the formula matches well with simulation based results. Using results from a Ginkgo Evaluation of Memory study, we calculate the number of true events in the validation set that would need to be observed for new studies comparing development of Alzheimer's disease among those with and without antihypertensive use, as well as the total number of subjects and number in the validation set to be recruited for these new trials.

Included in

Biostatistics Commons