## Publicly Accessible Penn Dissertations

#### Title

Inference for Approximating Regression Models

2014

Dissertation

#### Degree Name

Doctor of Philosophy (PhD)

Statistics

The assumptions underlying the Ordinary Least Squares (OLS) model are regularly and sometimes severely violated. In consequence, inferential procedures presumed valid for OLS are invalidated in practice. We describe a framework that is robust to model violations, and describe the modifications to the classical inferential procedures necessary to preserve inferential validity. As the covariates are assumed to be stochastically generated ("Random-X"), the sought after criterion for coverage becomes marginal rather than conditional. We focus on slopes, mean responses, and individual future observations. For slopes and mean responses, the targets of inference are redefined by means of least squares regression at the population level. The partial slopes that that regression defines, rather than the slopes of an assumed linear model, become the population quantities of interest, and they can be estimated unbiasedly. Under this framework, we estimate the Average Treatment Effect (ATE) in Randomized Controlled Studies (RCTs), and derive an estimator more efficient than one commonly used. We express the ATE as a slope coefficient in a population regression and immediately prove unbiasedness that way. For the mean response, the conditional value of the best least squares approximation to the response surface in the population - rather than the conditional value of y, is aimed to be captured. A calibration through pairs bootstrap can markedly improve such coverage. Moving to observations, we show that when attempting to cover future individual responses, a simple in-sample calibration technique that widens the empirical interval to contain $(1-\alpha)*100\%$ of the sample residuals is asymptotically valid, even in the face of gross model violations. OLS is startlingly robust to model departures when a future y needs to be covered, but nonlinearity, combined with a skewed X-distribution, can severely undermine coverage of the mean response. Our ATE estimator dominates the common estimator, and the stronger the R squared of the regression of a patient's response on covariates, treatment indicator, and interactions, the better our estimator's relative performance. By considering a regression model as a semi-parametric approximation to a stochastic mechanism, and not as its description, we rest assured that a coverage guarantee is a coverage guarantee.