DSpace

Statistics Papers

The aim of statistical modeling is to empower effective decision making, and the unique contribution of the field is its ability to incorporate multiple levels of uncertainty in the framing of wise decisions. Over the last few years, the development of new computational tools and the unprecedented evolution of “big data” have propelled statistical modeling to new levels. Today statistical modeling and machine learning have reached a level of impact that no large organization can afford to ignore. The information landscape is changing as it has never changed before.

At Wharton, the Department of Statistics is proud to have had a leadership role in this development. It participates in a wide range of university consortia that spans the fields of computer science, neuroscience, medicine, public policy, and finance. Moreover, our faculty members have won singular international recognition for their contributions to many parts of statistical science including observational studies, statistical algorithms, game theory, high dimensional inference, information theory, nonparametric function estimation, model selection, time series analysis, machine learning, and probability theory.

Wharton Faculty Research

Search results

Now showing 1 - 10 of 660

Degree Sequence of Random Permutation Graphs
(2017-01-01) Bhattacharya, Bhaswar B; Mukherjee, Sumit
In this paper, we study the asymptotics of the degree sequence of permutation graphs associated with a sequence of random permutations. The limiting finite-dimensional distributions of the degree proportions are established using results from graph and permutation limit theories. In particular, we show that for a uniform random permutation, the joint distribution of the degree proportions of the vertices labeled ⌈nr1⌉,⌈nr2⌉,…,⌈nrs⌉ in the associated permutation graph converges to independent random variables D(r1), D(r2),…, D(rs), where D(ri)∼Unif(ri,1−ri), for ri ∈ [0,1] and i ∈ {1,2,…,s}. Moreover, the degree proportion of the mid-vertex (the vertex labeled n/2) has a central limit theorem, and the minimum degree converges to a Rayleigh distribution after an appropriate scaling. Finally, the asymptotic finite-dimensional distributions of the permutation graph associated with a Mallows random permutation is determined, and interesting phase transitions are observed. Our results extend to other nonuniform measures on permutations as well.
Collision Times in Multicolor Urn Models and Sequential Graph Coloring With Applications to Discrete Logarithms
(2016-01-01) Bhattacharya, Bhaswar B
Consider an urn model where at each step one of q colors is sampled according to some probability distribution and a ball of that color is placed in an urn. The distribution of assigning balls to urns may depend on the color of the ball. Collisions occur when a ball is placed in an urn which already contains a ball of different color. Equivalently, this can be viewed as sequentially coloring a complete q-partite graph wherein a collision corresponds to the appearance of a monochromatic edge. Using a Poisson embedding technique, the limiting distribution of the first collision time is determined and the possible limits are explicitly described. Joint distribution of successive collision times and multi-fold collision times are also derived. The results can be used to obtain the limiting distributions of running times in various birthday problem based algorithms for solving the discrete logarithm problem, generalizing previous results which only consider expected running times. Asymptotic distributions of the time of appearance of a monochromatic edge are also obtained for other graphs
Universal Limit Theorems in Graph Coloring Problems With Connections to Extremal Combinatorics
(2017-01-01) Bhattacharya, Bhaswar B; Diaconis, Persi; Mukherjee, Sumit
This paper proves limit theorems for the number of monochromatic edges in uniform random colorings of general random graphs. These can be seen as generalizations of the birthday problem (what is the chance that there are two friends with the same birthday?). It is shown that if the number of colors grows to infinity, the asymptotic distribution is either a Poisson mixture or a Normal depending solely on the limiting behavior of the ratio of the number of edges in the graph and the number of colors. This result holds for any graph sequence, deterministic or random. On the other hand, when the number of colors is fixed, a necessary and sufficient condition for asymptotic normality is determined. Finally, using some results from the emerging theory of dense graph limits, the asymptotic (nonnormal) distribution is characterized for any converging sequence of dense graphs. The proofs are based on moment calculations which relate to the results of Erdős and Alon on extremal subgraph counts. As a consequence, a simpler proof of a result of Alon, estimating the number of isomorphic copies of a cycle of given length in graphs with a fixed number of edges, is presented.
Matrix Completion via Max-Norm Constrained Optimization
(2016-01-01) Cai, Tony; Zhou, Wen-Xin
Matrix completion has been well studied under the uniform sampling model and the trace-norm regularized methods perform well both theoretically and numerically in such a setting. However, the uniform sampling model is unrealistic for a range of applications and the standard trace-norm relaxation can behave very poorly when the underlying sampling scheme is non-uniform. In this paper we propose and analyze a max-norm constrained empirical risk minimization method for noisy matrix completion under a general sampling model. The optimal rate of convergence is established under the Frobenius norm loss in the context of approximately low-rank matrix reconstruction. It is shown that the max-norm constrained method is minimax rate-optimal and yields a unified and robust approximate recovery guarantee, with respect to the sampling distributions. The computational effectiveness of this method is also discussed, based on first-order algorithms for solving convex optimizations involving max-norm regularization.
The Bruss-Robertson Inequality: Elaborations, Extensions, and Applications
(2016-01-01) Steele, J. Michael
The Bruss-Robertson inequality gives a bound on themaximal number of elements of a random sample whose sum is less than a specifiedvalue, and the extension of that inequality which is given hereneither requires the independence of the summands nor requires the equality of their marginal distributions. A review is also given of the applications of the Bruss-Robertson inequality,especially the applications to problems of combinatorial optimization such as the sequential knapsack problem and the sequential monotone subsequence selection problem.
Potential Mechanisms for Cancer Resistance in Elephants and Comparative Cellular Response to DNA Damage in Humans
(2015-11-03) Abegglen, Lisa M; Caulin, Aleah Fox; Chan, Ashley; Lee, Kristy; Robinson, Rosann; Campbell, Michael S; Kiso, Wendy K; Schmitt, Dennis L; Waddell, Peter J; Bhaskara, Srividya; Jensen, Shane T; Maley, Carlo C; Schiffman, Joshua D
Importance: Evolutionary medicine may provide insights into human physiology and pathophysiology, including tumor biology. Objective: To identify mechanisms for cancer resistance in elephants and compare cellular response to DNA damage among elephants, healthy human controls, and cancer-prone patients with Li-Fraumeni syndrome (LFS). Design, Setting, and Participants: A comprehensive survey of necropsy data was performed across 36 mammalian species to validate cancer resistance in large and long-lived organisms, including elephants (n = 644). The African and Asian elephant genomes were analyzed for potential mechanisms of cancer resistance. Peripheral blood lymphocytes from elephants, healthy human controls, and patients with LFS were tested in vitro in the laboratory for DNA damage response. The study included African and Asian elephants (n = 8), patients with LFS (n = 10), and age-matched human controls (n = 11). Human samples were collected at the University of Utah between June 2014 and July 2015. Exposures: Ionizing radiation and doxorubicin. Main Outcomes and Measures: Cancer mortality across species was calculated and compared by body size and life span. The elephant genome was investigated for alterations in cancer-related genes. DNA repair and apoptosis were compared in elephant vs human peripheral blood lymphocytes. Results: Across mammals, cancer mortality did not increase with body size and/or maximum life span (eg, for rock hyrax, 1% [95% CI, 0%-5%]; African wild dog, 8% [95% CI, 0%-16%]; lion, 2% [95% CI, 0%-7%]). Despite their large body size and long life span, elephants remain cancer resistant, with an estimated cancer mortality of 4.81% (95% CI, 3.14%-6.49%), compared with humans, who have 11% to 25% cancer mortality. While humans have 1 copy (2 alleles) of TP53, African elephants have at least 20 copies (40 alleles), including 19 retrogenes (38 alleles) with evidence of transcriptional activity measured by reverse transcription polymerase chain reaction. In response to DNA damage, elephant lymphocytes underwent p53-mediated apoptosis at higher rates than human lymphocytes proportional to TP53 status (ionizing radiation exposure: patients with LFS, 2.71% [95% CI, 1.93%-3.48%] vs human controls, 7.17% [95% CI, 5.91%-8.44%] vs elephants, 14.64% [95% CI, 10.91%-18.37%]; P < .001; doxorubicin exposure: human controls, 8.10% [95% CI, 6.55%-9.66%] vs elephants, 24.77% [95% CI, 23.0%-26.53%]; P < .001). Conclusions and Relevance: Compared with other mammalian species, elephants appeared to have a lower-than-expected rate of cancer, potentially related to multiple copies of TP53. Compared with human cells, elephant cells demonstrated increased apoptotic response following DNA damage. These findings, if replicated, could represent an evolutionary-based approach for understanding mechanisms related to cancer suppression.
Efficient Empirical Bayes Prediction Under Check Loss Using Asymptotic Risk Estimates
(2016-01-01) Mukherjee, Gourab; Brown, Lawrence D; Rusmevichientong, Paat
We develop a novel Empirical Bayes methodology for prediction under check loss in high-dimensional Gaussian models. The check loss is a piecewise linear loss function having differential weights for measuring the amount of underestimation or overestimation. Prediction under it differs in fundamental aspects from estimation or prediction under weighted-quadratic losses. Because of the nature of this loss, our inferential target is a pre-chosen quantile of the predictive distribution rather than the mean of the predictive distribution. We develop a new method for constructing uniformly efficient asymptotic risk estimates which are then minimized to produce effective linear shrinkage predictive rules. In calculating the magnitude and direction of shrinkage, our proposed predictive rules incorporate the asymmetric nature of the loss function and are shown to be asymptotically optimal. Using numerical experiments we compare the performance of our method with traditional Empirical Bayes procedures and obtain encouraging results.
Beardwood-Halton-Hammersly Theorem for Stationary Ergodic Sequences: A Counterexample
(2016-01-01) Arlotto, Alessandro; Steele, J. Michael
We construct a stationary ergodic process X1,X2,…such that each Xt has the uniform distribution on the unit square and the length Ln of the shortest path through the points X1,X2,…,Xn is not asymptotic to a constant times the square root of n. In other words, we show that the Beardwood, Halton, and Hammersley theorem does not extend from the case of independent uniformly distributed random variables to the case of stationary ergodic sequences with uniform marginal distributions.
Sparse CCA: Adaptive Estimation and Computational Barriers
(2016-01-01) Gao, Chao; Ma, Zongming; Zhou, Harrison
Canonical correlation analysis (CCA) is a classical and important multivariate technique for exploring the relationship between two sets of variables. It has applications in many fields including genomics and imaging, to extract meaningful features as well as to use the features for subsequent analysis. This paper considers adaptive and computationally tractable estimation of leading sparse canonical directions when the ambient dimensions are high. Three intrinsically related problems are studied to fully address the topic. First, we establish the minimax rates of the problem under prediction loss. Separate minimax rates are obtained for canonical directions of each set of random variables under mild conditions. There is no structural assumption needed on the marginal covariance matrices as long as they are well conditioned. Second, we propose a computationally feasible two-stage estimation procedure, which consists of a convex programming based initialization stage and a group-Lasso based refinement stage, to attain the minimax rates under an additional sample size condition. Finally, we provide evidence that the additional sample size condition is essentially necessary for any randomized polynomial-time estimator to be consistent, assuming hardness of the Planted Clique detection problem. The computational lower bound is faithful to the Gaussian models used in the paper, which is achieved by a novel construction of the reduction scheme and an asymptotic equivalence theory for Gaussian discretization that is necessary for computational complexity to be well-defined. As a byproduct, we also obtain computational lower bound for the sparse PCA problem under the Gaussian spiked covariance model. This bridges a gap in the sparse PCA literature.
Nonparametric Multi-Level Clustering of Human Epilepsy Seizures
(2016-01-01) Wulsin, Drausin F; Jensen, Shane T; Litt, Brian
Understanding neuronal activity in the human brain is an extremely difficult problem both in terms of measurement and statistical modeling. We address a particular research question in this area: the analysis of human intracranial electroencephalogram (iEEG) recordings of epileptic seizures from a collection of patients. In these data, each seizure of each patient is defined by the activities of many individual recording channels. The modeling of epileptic seizures is challenging due the large amount of heterogeneity in iEEG signal between channels within a particular seizure, between seizures within an individual, and across individuals. We develop a new nonparametric hierarchical Bayesian model that simultaneously addresses these multiple levels of heterogeneity in our epilepsy data. Our approach, which we call a multi-level clustering hierarchical Dirichlet process (MLC-HDP), clusters over channel activities within a seizure, over seizures of a patient and over patients. We demonstrate the advantages of our methodology over alternative approaches in human EEG seizure data and show that its seizure clustering is close to manual clustering by a physician expert. We also address important clinical questions like “to which seizures of other patients is this seizure similar?”

Statistics Papers

Filters

Author

Subject

Date

Type

Publication Type

Settings

Sort By

Results per page

Search results

Usage statistics

Penn's Heritage