Date of Award
Doctor of Philosophy (PhD)
Lawrence D. Brown
In the classical theory of statistical inference, data is assumed to be generated from a known model, and the properties of the parameters in the model are of interest. In applications, however, it is often the case that the model that generates the data is unknown, and as a consequence a model is often chosen based on the data. In my dissertation research, we study how to achieve valid inference when the model or hypotheses are data-driven. We study three scenarios, which are summarized in the three chapters.
In the first chapter, we study the common practice to perform data-driven variable selection and derive statistical inference from the resulting model. We find such inference enjoys none of the guarantees that classical statistical theory provides for tests and confidence intervals when the model has been chosen a priori. We propose to produce valid "post-selection inference" by reducing the problem to one of simultaneous inference. Simultaneity is required for all linear functions that arise as coefficient estimates in all submodels. By purchasing "simultaneity insurance" for all possible submodels, the resulting post-selection inference is rendered universally valid under all possible model selection procedures. This inference is therefore generally conservative for particular selection procedures, but it is always more precise than full Scheffé protection. Importantly it does not depend on the truth of the selected submodel, and hence it produces valid inference even in wrong models. We describe the structure of the simultaneous inference problem and give some asymptotic results.
In the second chapter of this thesis, we propose a different approach to achieve valid post-selection inference which corresponds to the treatment of the design matrix predictors as random. Our methodology is based on two techniques, namely split samples and the bootstrap. Split-sample methodology generally involves dividing the observations randomly into two parts: one part for exploratory model building, a.k.a. the training set or planning sample, and the other part for confirmatory statistical inference, a.k.a. holdout set or analysis sample. We use a training sample only to seek a subset of predictors, and then perform the estimation and inference on the holdout set. As far as inference after selection in linear models is concerned, the main advantage of this technique is, roughly speaking, that it separates the data for exploratory analysis from the data for confirmatory analysis, thereby removing the contaminating effect of selection on inference. We show that the our procedure achieves valid inference asymptotically for any selection rule.
The third part of the thesis is an application of the split samples method to an observational study on the effect of obstetric unit closures in Philadelphia. The splitting was successful twice over: (i) it successfully identified an interesting and moderately insensitive conclusion, (ii) by comparison of the planning and analysis samples, it is clearly seen to have avoided an exaggerated claim of insensitivity to unmeasured bias that might have occurred by focusing on the least sensitive of many findings. Under the assumption of no unmeasured confounding, we found strong evidence that obstetric unit closures caused birth injuries. We also showed this conclusion to be insensitive to bias from a moderate amount of unmeasured confounding.
Zhang, Kai, "Valid Post-Selection Inference" (2012). Publicly Accessible Penn Dissertations. 598.