Unified Framework for Post-Selection Inference

Arun Kumar Kuchibhotla, University of Pennsylvania


The development of the classical inferential theory of mathematical statistics is based on the philosophy that all the models to fit, all the hypotheses to test and all the parameters to do inference for are fixed prior to the collection of data. Interestingly and in fact, more concerningly, this is not how the practice of statistics is. The practice of statistics often explores (if not tortures) the data to find the “right” model to fit to the data, “right” hypothesis to test and so on. Quoting Tullock (2001, page 205) As Ronald Coase says, ”if you torture the data long enough it will confess”. The young researcher, convinced he knows the truth will make changes in his specifications and very likely produce significant results. In some cases this is correct; his original specification was wrong and his new one is right. Nevertheless, this procedure reduces the significance of the significance test. Once the data is explored to find the hypothesis or model, the classical theory is (bluntly speaking) useless for inference and can in fact be very misleading. The current thesis focuses on the problem of providing Valid Inference after Data Exploration (VIDE). Although a unified framework is provided for such a goal, the framework is explained through the problem of inference with the ordinary least squares linear regression estimator when the data is explored to find the “right” subset of covariates to be used in the regression model. Valid post-selection inference has been a topic of research interest at least since 1960’s but has received increasing attention in the recent times. Invalidity of classical inference in post-selection problems may not only be due to the selection but also due to misspecification of model. Misspecification is a very natural outcome of model selection since the selected model cannot always be guaranteed to match the truth. If such a guarantee exists, then the post-selection problem does not require further study. Most of the literature on valid post-selection inference has concentrated on the assumption of a true parametric model. In this thesis, valid post-selection inference is provided under no parametric assumptions. The simplest setting in this thesis is when the observations are independent satisfying certain moment restrictions (and no further model/distributional assumptions). Extensions to various dependent settings are also given. Throughout, the total number of covariates available is allowed to grow with the sample size and can be almost exponential in the sample size.

Subject Area

Statistics|Mathematics|Applied Mathematics

Recommended Citation

Kuchibhotla, Arun Kumar, "Unified Framework for Post-Selection Inference" (2020). Dissertations available from ProQuest. AAI27833690.