Date of Award


Degree Type


Degree Name

Doctor of Philosophy (PhD)

Graduate Group

Applied Mathematics

First Advisor

Edgar Dobriban

Second Advisor

Robin Pemantle


We live in an age of big data. Analyzing modern data sets can be very difficult because they usually present the following features: massive, high-dimensional, and heterogeneous. How to deal with these new features often plays a key role in modern statistical and machine learning research. This dissertation uses random matrix theory (RMT), a powerful mathematical tool, to study several important problems where the data is massive, high-dimensional, and sometimes heterogeneous.

The first chapter briefly introduces some basics of random matrix theory (RMT). We also cover some classical applications of RMT to statistics and machine learning.

The second chapter is about distributed linear regression, where we consider the ordinary least squares (OLS) estimators. Distributed statistical learning problems arise commonly when dealing with large datasets. In this setup, datasets are partitioned over machines, which compute locally and communicate short messages. Communication is often the bottleneck. We study one-step and iterative weighted parameter averaging in statistical linear models under data parallelism. We do linear regression on each machine, send the results to a central server, and take a weighted average of the parameters. Optionally, we iterate, sending back the weighted average and doing local ridge regressions centered at it. How does this work compare to doing linear regression on the full data? Here we study the performance loss in estimation and test error, and confidence interval length in high dimensions, where the number of parameters is comparable to the training data size. We find the performance loss in one-step weighted averaging, and also give results for iterative averaging. We also find that different problems are affected differently by the distributed framework.

The third chapter studies a fundamental and highly important problem in this area: How to do ridge regression in a distributed computing environment? Ridge regression is an extremely popular method for supervised learning and has several optimality properties, thus it is important to study. We study one-shot methods that construct weighted combinations of ridge regression estimators computed on each machine. By analyzing the mean squared error in a high dimensional random-effects model where each predictor has a small effect, we discover several new phenomena. We also propose a new Weighted ONe-shot DistributEd Ridge regression (WONDER) algorithm. We test WONDER in simulation studies and using the Million Song Dataset as an example. There it can save at least 100x in computation time, while nearly preserving test accuracy.

The fourth chapter is trying to solve another possible issue with modern data sets, that is heterogeneity. Dimensionality reduction via PCA and factor analysis is an important tool of data analysis. A critical step is selecting the number of components. However, existing methods (such as the scree plot, likelihood ratio, parallel analysis, etc) do not have statistical guarantees in the increasingly common setting where the data are heterogeneous. There each noise entry can have a different distribution. To address this problem, we propose the Signflip Parallel Analysis (Signflip PA) method: it compares data singular values to those of “empirical null” data generated by flipping the sign of each entry randomly with probability one-half. We show that Signflip PA consistently selects factors above the noise level in high-dimensional signal-plus-noise models (including spiked models and factor models) under heterogeneous settings. Here the classical parallel analysis is no longer effective. To do this, we propose to leverage recent breakthroughs in random matrix theory, such as dimension-free operator norm bounds and large deviations for the top eigenvalues of nonhomogeneous matrices. We also illustrate that Signflip PA performs well in numerical simulations and on empirical data examples.

Files over 3MB may be slow to open. For best results, right-click and select "save as..."