Combinatorial Optimization On Massive Datasets: Streaming, Distributed, And Massively Parallel Computation

Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Computer and Information Science
Combinatorial Optimization
Communication Complexity
Distributed Computing
Massively Parallel Computation
Computer Sciences
Grant number
Copyright date
Related resources

With the emergence of massive datasets across different application domains, there is a rapidly growing need to solve various optimization tasks over such datasets. This in turn raises the following fundamental question: How well can we solve a large-scale optimization problem on massive datasets in a resource-efficient manner? The focus of this thesis is on answering this question for various problems in different modern computational models for processing massive datasets, in particular, streaming, distributed, and massively parallel computation (such as MapReduce) models. The first part of this thesis is focused on graph optimization, in which we study several fundamental graph optimization problems including matching, vertex cover, and connectivity. The main results in this part include a tight bound on the space complexity of the matching problem in dynamic streams, a unifying framework for achieving algorithms that improve the state-of-the-art for matching and vertex cover problems in all the mentioned models, and the first massively parallel algorithm for connectivity on sparse graphs that improve upon the classical parallel PRAM algorithms for a large family of graphs. In the second part of the thesis, we consider submodular optimization and in particular two canonical examples of set cover and maximum coverage problems. We establish tight bounds on the space complexity of approximating set cover and maximum coverage in both single- and multi-pass streams, and build on these results to address the general problem of submodular maximization across all aforementioned models. In the last part, we consider resource constrained optimization settings which require computation over data which is not particularly large but still imposes restrictions of similar nature. Examples include optimization over data which can only be accessed with limited adaptivity (e.g. in crowdsourcing applications), or corresponds to private information of agents and is not available in an integrated form (e.g., in auction and mechanism design applications). We use the toolkit developed in the first two parts of the thesis to obtain several new results for such problems, significantly improving the state-of-the-art. A common theme in this thesis is a rigorous study of problems from both an algorithmic perspective and impossibility results. Rather than coming up with ad hoc solutions to problems at hand, the goal here is to develop general techniques for solving various large-scale optimization problems in a unified way in different models, which in turns requires further understanding of the limitations of known approaches for addressing these problems.

Sanjeev Khanna
Date of degree
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher DOI
Journal Issue
Recommended citation