Date of Award

2020

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Graduate Group

Statistics

First Advisor

Lawrence D. Brown

Second Advisor

Andreas Buja

Abstract

The development of the classical inferential theory of mathematical statistics is

based on the philosophy that all the models to fit, all the hypotheses to test and all the

parameters to do inference for are fixed prior to the collection of data. Interestingly

and in fact, more concerningly, this is not how the practice of statistics is. The practice

of statistics often explores (if not tortures) the data to find the \right" model to fit

to the data, \right" hypothesis to test and so on. Quoting Tullock (2001, page 205)

As Ronald Coase says, "if you torture the data long enough it will confess".

The young researcher, convinced he knows the truth will make changes

in his specifications and very likely produce significant results. In some

cases this is correct; his original specification was wrong and his new

one is right. Nevertheless, this procedure reduces the significance of the

significance test.

Once the data is explored to find the hypothesis or model, the classical theory is

(bluntly speaking) useless for inference and can, in fact, be very misleading.

The current thesis focuses on the problem of providing Valid Inference after Data

Exploration (VIDE). Although a unified framework is provided for such a goal, the framework is explained through the problem of inference with the ordinary least

squares linear regression estimator when the data is explored to find the "right"

subset of covariates to be used in the regression model.

Valid post-selection inference has been a topic of research interest at least since

1960’s but has received increasing attention in recent times. The invalidity of classical

inference in post-selection problems may not only be due to the selection but also

due to misspecification of model. Misspecification is a very natural outcome of model

selection since the selected model cannot always be guaranteed to match the truth.

If such a guarantee exists, then the post-selection problem does not require further

study. Most of the literature on valid post-selection inference has concentrated on

the assumption of a true parametric model.

In this thesis, valid post-selection inference is provided under no parametric assumptions. The simplest setting in this thesis is when the observations are independent satisfying certain moment restrictions (and no further model/distributional

assumptions). Extensions to various dependent settings are also given. Throughout,

the total number of covariates available is allowed to grow with the sample size and

can be almost exponential in the sample size.

Share

COinS