Statistics Papers

The aim of statistical modeling is to empower effective decision making, and the unique contribution of the field is its ability to incorporate multiple levels of uncertainty in the framing of wise decisions. Over the last few years, the development of new computational tools and the unprecedented evolution of “big data” have propelled statistical modeling to new levels. Today statistical modeling and machine learning have reached a level of impact that no large organization can afford to ignore. The information landscape is changing as it has never changed before.

At Wharton, the Department of Statistics is proud to have had a leadership role in this development. It participates in a wide range of university consortia that spans the fields of computer science, neuroscience, medicine, public policy, and finance. Moreover, our faculty members have won singular international recognition for their contributions to many parts of statistical science including observational studies, statistical algorithms, game theory, high dimensional inference, information theory, nonparametric function estimation, model selection, time series analysis, machine learning, and probability theory.

 

Search results

Now showing 1 - 10 of 480
  • Publication
    Bugs on a Budget: Distributed Sensing With Cost for Reporting and Non-Reporting
    (2010-10-01) Pozdnyakov, Vladamir; Steele, J Michael
    We consider a simple model of sequential decisions made by a fusion agent that receives binary-passive reports from distributed sensors. The main result is an explicit formula for the probability of making a decision before a fixed budget is exhausted. These results depend on the relationship between a special ruin problem for a “lazy random walk” and a traditional biased walk.
  • Publication
    The Projection Median of a Set of Points in Rd⋆
    (2012-03-01) Basu, Riddhipratim; Bhattacharya, Bhaswar B; Talukdar, Tanmoy
    The projection median of a finite set of points in R2 was introduced by Durocher and Kirkpatrick [Computational Geometry: Theory and Applications, Vol. 42 (5), 364–375, 2009]. They proved that the projection median in R2 provides a better approximation of the 2-dimensional Euclidean median, than the center of mass or the rectilinear median, while maintaining a fixed degree of stability. In this paper we study the projection median of a set of points in Rd for d ≥ 2. Using results from the theory of integration over topological groups, we show that the d-dimensional projection median provides a (d /π)B(d/2, 1/2)-approximation to the d-dimensional Euclidean median, where B(α, β) denotes the Beta function. We also show that the stability of the d-dimensional projection median is at least 1⁄(d/π)B(d/2,1/2), and its breakdown point is 1/2. Based on the stability bound and the breakdown point, we compare the d-dimensional projection median with the rectilinear median and the center of mass, as a candidate for approximating the d-dimensional Euclidean median. For the special case of d = 3, our results imply that the 3-dimensional projection median is a (3/2)-approximation of the 3-dimensional Euclidean median, which settles a conjecture posed by Durocher.
  • Publication
    The Noisy Secretary Problem and Some Results on Extreme Concomitant Variables
    (2012-09-01) Krieger, Abba M; Samuel-Cahn, Ester
    The classical secretary problem for selecting the best item is studied when the actual values of the items are observed with noise. One of the main appeals of the secretary problem is that the optimal strategy is able to find the best observation with the nontrivial probability of about 0.37, even when the number of observations is arbitrarily large. The results are strikingly di↵erent when the quality of the secretaries are observed with noise. If there is no noise, then the only information that is needed is whether an observation is the best among those already observed. Since observations are assumed to be i.i.d. this is distribution free. In the case of noisy data, the results are no longer distrubtion free. Furthermore, one needs to know the rank of the noisy observation among those already seen. Finally, the probability of finding the best secretary often goes to 0 as the number of obsevations, n, goes to infinity. The results depend heavily on the behavior of pn, the probability that the observation that is best among the noisy observations is also best among the noiseless observations. Results involving optimal strategies if all that is available is noisy data are described and examples are given to elucidate the results.
  • Publication
    A Bivariate Timing Model of Customer Acquisition and Retention
    (2008-01-01) Schweidel, David A; Fader, Peter S; Bradlow, Eric T
    Two widely recognized components, central to the calculation of customer value, are acquisition and retention propensities. However, while extant research has incorporated such components into different types of models, limited work has investigated the kinds of associations that may exist between them. In this research, we focus on the relationship between a prospective customer's time until acquisition of a particular service and the subsequent duration for which he retains it, and examine the implications of this relationship on the value of prospects and customers. To accomplish these tasks, we use a bivariate timing model to capture the relationship between acquisition and retention. Using a split-hazard model, we link the acquisition and retention processes in two distinct yet complementary ways. First, we use the Sarmonov family of bivariate distributions to allow for correlations in the observed acquisition and retention times within a customer; next, we allow for differences across customers using latent classes for the parameters that govern the two processes. We then demonstrate how the proposed methodology can be used to calculate the discounted expected value of a subscription based on the time of acquisition, and discuss possible applications of the modeling framework to problems such as customer targeting and resource allocation.
  • Publication
    Reflections on the Occasion of the 100th Anniversary of the Monthly Labor Review
    (2016-01-01) Brown, Lawrence D; Lynch, Lisa M; Citro, Constance F
    It is an honor to comment on directions for the Monthly Labor Review MLR over its next 25 years. The MLR is the federal government's oldest continuous publication—first printed in 1915 and now published online by the Bureau of Labor Statistics (BLS), one of the nation's oldest statistical agencies, established in 1884. BLS embodies the standards articulated by the Committee on National Statistics (CNSTAT) in the fifth edition of its quadrennial volume Principles and Practices for a Federal Statistical Agency (National Research Council, 2013). P&P lays down four principles: that a statistical agency produce data relevant to policy issues, earn credibility with data users, earn the trust of data providers (e.g., households, businesses), and maintain independence from political and other undue external influence.
  • Publication
    Count Models Based on Weibull Interarrival Times
    (2008-01-01) McShane, Blake; Adrian, Moshe; Bradlow, Eric T; Fader, Peter S
    The widespread popularity and use of both the Poisson and the negative binomial models for count data arise, in part, from their derivation as the number of arrivals in a given time period assuming exponentially distributed interarrival times (without and with heterogeneity in the underlying base rates, respectively). However, with that clean theory come some limitations including limited flexibility in the assumed underlying arrival rate distribution and the inability to model underdispersed counts (variance less than the mean). Although extant research has addressed some of these issues, there still remain numerous valuable extensions. In this research, we present a model that, due to computational tractability, was previously thought to be infeasible. In particular, we introduce here a generalized model for count data based upon an assumed Weibull interarrival process that nests the Poisson and negative binomial models as special cases. The computational intractability is overcome by deriving the Weibull count model using a polynomial expansion which then allows for closed-form inference (integration term-by-term) when incorporating heterogeneity due to the conjugacy of the expansion and a commonly employed gamma distribution. In addition, we demonstrate that this new Weibull count model can (1) model both over- and underdispersed count data, (2) allow covariates to be introduced in a straightforward manner through the hazard function, and (3) be computed in standard software.
  • Publication
    Testing Behavioral Hypotheses Using an Integrated Model of Grocery Store Shopping Path and Purchase Behavior
    (2009-10-01) Hui, Sam K; Bradlow, Eric T; Fader, Peter S
    We examine three sets of established behavioral hypotheses about consumers' in-store behavior using field data on grocery store shopping paths and purchases. Our results provide field evidence for the following empirical regularities. First, as consumers spend more time in the store, they become more purposeful—they are less likely to spend time on exploration and more likely to shop/buy. Second, consistent with “licensing” behavior, after purchasing virtue categories, consumers are more likely to shop at locations that carry vice categories. Third, the presence of other shoppers attracts consumers toward a store zone but reduces consumers' tendency to shop there.
  • Publication
    Path Data in Marketing: An Integrative Framework and Prospectus for Model Building
    (2009-01-01) Hui, Sam K; Fader, Peter S; Bradlow, Eric T
    Many data sets, from different and seemingly unrelated marketing domains, all involve paths—records of consumers' movements in a spatial configuration. Path data contain valuable information for marketing researchers because they describe how consumers interact with their environment and make dynamic choices. As data collection technologies improve and researchers continue to ask deeper questions about consumers' motivations and behaviors, path data sets will become more common and will play a more central role in marketing research. To guide future research in this area, we review the previous literature, propose a formal definition of a path (in a marketing context), and derive a unifying framework that allows us to classify different kinds of paths. We identify and discuss two primary dimensions (characteristics of the spatial configuration and the agent) as well as six underlying subdimensions. Based on this framework, we cover a range of important operational issues that should be taken into account as researchers begin to build formal models of path-related phenomena. We close with a brief look into the future of path-based models, and a call for researchers to address some of these emerging issues.
  • Publication
    Maximizing Voronoi Regions of a Set of Points Enclosed in a Circle with Applications to Facility Location
    (2010-12-01) Bhattacharya, Bhaswar B
    In this paper we introduce an optimization problem which involves maximization of the area of Voronoi regions of a set of points placed inside a circle. Such optimization goals arise in facility location problems consisting of both mobile and stationary facilities. Let ψ be a circular path through which mobile service stations are plying, and S be a set of n stationary facilities (points) inside ψ. A demand point p is served from a mobile facility plying along ψ if the distance of p from the boundary of ψ is less than that from any member in S. On the other hand, the demand point p is served from a stationary facility p i  ∈ S if the distance of p from p i is less than or equal to the distance of p from all other members in S and also from the boundary of ψ. The objective is to place the stationary facilities in S, inside ψ, such that the total area served by them is maximized. We consider a restricted version of this problem where the members in S are placed equidistantly from the center o of ψ. It is shown that the maximum area is obtained when the members in S lie on the vertices of a regular n-gon, with its circumcenter at o. The distance of the members in S from o and the optimum area increases with n, and at the limit approaches the radius and the area of the circle ψ, respectively. We also consider another variation of this problem where a set of n points is placed inside ψ, and the task is to locate a new point q inside ψ such that the area of the Voronoi region of q is maximized. We give an exact solution of this problem when n = 1 and a (1 − ε)-approximation algorithm for the general case.
  • Publication
    Fast Computation of Kernel Estimators
    (2010-01-01) Raykar, Vikas C; Duraiswami, Ramani; Zhao, Linda H
    The computational complexity of evaluating the kernel density estimate (or its derivatives) at m evaluation points given n sample points scales quadratically as O(nm)—making it prohibitively expensive for large datasets. While approximate methods like binning could speed up the computation, they lack a precise control over the accuracy of the approximation. There is no straightforward way of choosing the binning parameters a priori in order to achieve a desired approximation error. We propose a novel computationally efficient ε-exact approximation algorithm for the univariate Gaussian kernel-based density derivative estimation that reduces the computational complexity from O(nm) to linear O(n+m). The user can specify a desired accuracy ε. The algorithm guarantees that the actual error between the approximation and the original kernel estimate will always be less than ε. We also apply our proposed fast algorithm to speed up automatic bandwidth selection procedures. We compare our method to the best available binning methods in terms of the speed and the accuracy. Our experimental results show that the proposed method is almost twice as fast as the best binning methods and is around five orders of magnitude more accurate. The software for the proposed method is available online.