Guha, Sudipto

Email Address
ORCID
Disciplines
Research Projects
Organizational Units
Position
Introduction
Research Interests

Search Results

Now showing 1 - 10 of 17
  • Publication
    Tight results for clustering and summarizing data streams
    (2009-01-09) Guha, Sudipto
    In this paper we investigate algorithms and lower bounds for summarization problems over a single pass data stream. In particular we focus on histogram construction and K-center clustering. We provide a simple framework that improves upon all previous algorithms on these problems in either the space bound, the approximation factor or the running time. The framework uses a notion of ``streamstrapping'' where summaries created for the initial prefixes of the data are used to develop better approximation algorithms. We also prove the first non-trivial lower bounds for these problems. We show that the stricter requirement that if an algorithm accurately approximates the error of every bucket or every cluster produced by it, then these upper bounds are almost the best possible. This property of accurate estimation is true of all known upper bounds on these problems.
  • Publication
    A Substrate for In-Network Sensor Data Integration
    (2008-08-24) Mihaylov, Svilen; Jacob, Marie; Ives, Zachary G; Guha, Sudipto
    With the ultimate goal of extending the data integration paradigm and query processing capabilities to ad hoc wireless networks, sensors, and stream systems, we consider how to support communication between sets of nodes performing distributed joins in sensor networks. We develop a communication model that enables in-network join at a variety of locations, and which facilitates coordination among nodes in order to make optimization decisions. While we defer a discussion of the optimizer to future work, we experimentally compare a variety of strategies, including at-base and in-network joins. Results show significant performance gains versus prior work, as well as opportunities for optimization.
  • Publication
    Revisiting the Direct Sum Theorem and Space Lower Bounds in Random Order Streams
    (2009-04-27) Guha, Sudipto; Huang, Zhiyi
    Estimating frequency moments and $L_p$ distances are well studied problems in the adversarial data stream model and tight space bounds are known for these two problems. There has been growing interest in revisiting these problems in the framework of random-order streams. The best space lower bound known for computing the $k^{th}$ frequency moment in random-order streams is $\Omega(n^{1-2.5/k})$ by Andoni et al., and it is conjectured that the real lower bound shall be $\Omega(n^{1-2/k})$. In this paper, we resolve this conjecture. In our approach, we revisit the direct sum theorem developed by Bar-Yossef et al. in a random-partition private messages model and provide a tight $\Omega(n^{1-2/k}/\ell)$ space lower bound for any $\ell$-pass algorithm that approximates the frequency moment in random-order stream model to a constant factor. Finally, we also introduce the notion of space-entropy tradeoffs in random order streams, as a means of studying intermediate models between adversarial and fully random order streams. We show an almost tight space-entropy tradeoff for $L_\infty$ distance and a non-trivial tradeoff for $L_p$ distances.
  • Publication
    Linear Programming in the Semi-streaming Model with Application to the Maximum Matching Problem
    (2012-01-29) Ahn, KookJin; Guha, Sudipto
    In this paper we study linear-programming based approaches to the maximum matching problem in the semi-streaming model. In this model edges are presented sequentially, possibly in an adversarial order, and we are only allowed to use a small space. The allowed space is near linear in the number of vertices (and sublinear in the number of edges) of the input graph. The semi-streaming model is relevant in the context of processing of very large graphs. In recent years, there have been several new and exciting results in the semi-streaming model. However broad techniques such as linear programming have not been adapted to this model. In this paper we present several techniques to adapt and optimize linear-programming based approaches in the semi-streaming model. We use the maximum matching problem as a foil to demonstrate the effectiveness of adapting such tools in this model. As a consequence we improve almost all previous results on the semi-streaming maximum matching problem. We also prove new results on interesting variants.
  • Publication
    Lower Bounds for Quantile Estimation in Random-Order and Multi-Pass Streaming
    (2007-08-26) Guha, Sudipto; McGregor, Andrew
    We present lower bounds on the space required to estimate the quantiles of a stream of numerical values. Quantile estimation is perhaps the most studied problem in the data stream model and it is relatively well understood in the basic single-pass data stream model in which the values are ordered adversarially. Natural extensions of this basic model include the random-order model in which the values are ordered randomly (e.g. [21,5,13,11,12]) and the multi-pass model in which an algorithm is permitted a limited number of passes over the stream (e.g. [6,7,1,19,2,6,7,19,2]). We present lower bounds that complement existing upper bounds [21,11] in both models. One consequence is an exponential separation between the random-order and adversarial-order models: using Ω(polylog n) space, exact selection requires Ω(log n) passes in the adversarial-order model while O(loglog n) passes are sufficient in the random-order model.
  • Publication
    Dynamic Join Optimization in Multi-Hop Wireless Sensor Networks
    (2010-01-01) Mihaylov, Svilen; Ives, Zachary G; Jacob, Marie; Guha, Sudipto
    To enable smart environments and self-tuning data centers, we are developing the Aspen system for integrating physical sensor data, as well as stream data coming from machine logical state, and database or Web data from the Internet. A key component of this system is a query processor optimized for limited-bandwidth, possibly battery-powered devices with multiple hop wireless radio communications. This query processor is given a portion of a data integration query, possibly including joins among sensors, to execute. Several recent papers have developed techniques for computing joins in sensors, but these techniques are static and are only appropriate for specific join selectivity ratios. We consider the problem of dynamic join optimization for sensor networks, developing solutions that employ cost modeling, as well as adaptive learning and self-tuning heuristics to choose the best algorithm under real and variable selectivity values. We focus on in-network join computation, but our architecture extends to other approaches (and we compare against these). We develop basic techniques assuming selectivities are uniform and known in advance, and optimization can be done on a pairwise basis; we then extend the work to handle joins between multiple pairs, when selectivities are not fully known. We experimentally validate our work at scale using standard datasets.
  • Publication
    SmartCIS: Integrating Digital and Physical Environments
    (2010-01-01) Liu, Mengmeng; Mihaylov, Svilen; Ives, Zachary G; Bao, Zhuowei; Loo, Boon Thau; Jacob, Marie; Guha, Sudipto
  • Publication
    The Steiner k-Cut Problem
    (2006-03-24) Chekuri, Chandra; Guha, Sudipto; Naor, Joseph
    We consider the Steiner k-cut problem which generalizes both the k-cut problem and the multiway cut problem. The Steiner k-cut problem is defined as follows. Given an edge-weighted undirected graph G = (V,E), a subset of vertices X ⊆ V called terminals, and an integer k ≤ |X|, the objective is to find a minimum weight set of edges whose removal results in k disconnected components, each of which contains at least one terminal. We give two approximation algorithms for the problem: a greedy (2 − 2/k )-approximation based on Gomory–Hu trees, and a (2 − 2/|X|)-approximation based on rounding a linear program. We use the insight from the rounding to develop an exact bidirected formulation for the global minimum cut problem (the k-cut problem with k = 2).
  • Publication
    A Substrate for In-Network Sensor Data Integration
    (2008-10-01) Mihaylov, Svilen R; Jacob, Marie; Ives, Zachary G; Guha, Sudipto
    With the ultimate goal of extending the data integration paradigm and query processing capabilities to ad hoc wireless networks, sensors, and stream systems, we consider how to support communication between sets of nodes performing distributed joins in sensor networks. We develop a communication model that enables in-network join at a variety of locations, and which facilitates coordination among nodes in order to make optimization decisions. While we defer a discussion of the optimizer to future work, we experimentally compare a variety of strategies, including at-base and in-network joins. Results show significant performance gains versus prior work, as well as opportunities for optimization.
  • Publication
    Machine Minimization for Scheduling Jobs with Interval Constraints
    (2004-10-17) Guha, Sudipto; Chuzhoy, Julia; Khanna, Sanjeev; Naor, Joseph
    The problem of scheduling jobs with interval constraints is a well-studied classical scheduling problem. The input to the problem is a collection of n jobs where each job has a set of intervals on which it can be scheduled. The goal is to minimize the total number of machines needed to schedule all jobs subject to these interval constraints. In the continuous version, the allowed intervals associated with a job form a continuous time segment, described by a release date and a deadline. In the discrete version of the problem, the set of allowed intervals for a job is given explicitly. So far, only an 0[(log n)/(log log n)]-approximation is known for either version of the problem, obtained by a randomized rounding of a natural linear programming relaxation of the problem. In fact, we show here that this analysis is tight for both versions of the problem by providing a matching lower bound on the integrality gap of the linear program. Moreover, even when all jobs can be scheduled on a single machine, the discrete case has recently been shown to be Ω(log log n)-hard to approximate. In this paper we provide improved approximation factors for the number of machines needed to schedule all jobs in the continuous version of the problem. Our main result is an O(1)-approximation algorithm when the optimal number of machines needed is bounded by a fixed constant. Thus, our results separate the approximability of the continuous and the discrete cases of the problem. For general instances, we strengthen the natural linear programming relaxation in a recursive manner by forbidding certain configurations which cannot arise in an integral feasible solution. This yields an O(OPT)-approximation, where OPT denotes the number of machines needed by an optimal solution. Combined with earlier results, our work implies an 0{√[(log n)/(log log n)]}-approximation for any value of OPT.