Processing data streams

Andrew McGregor, University of Pennsylvania


Data streams are ubiquitous. Examples include the network traffic flowing past a router, data generated by an SQL query, readings taken in a sensor network, and files read from an external memory device. The data-stream model abstracts the main algorithmic constraints when processing such data: sequential access to data, limited time to process each data item, and limited working memory. The challenge of designing efficient algorithms in this model has enjoyed significant attention over the last ten years. In this thesis we investigate two new directions of data-streams research: stochastic streams where data-streams are considered through a statistical or learning-theoretic lens and graph streams where highly-structured data needs to be processed. Much of the work in this thesis is motivated by the following general questions. Stream Ordering: Almost all prior work has considered streams that are assumed to be adversarially ordered. What happens if we relax this assumption? Multiple Passes : Previous work has focused on the single-pass model. What trade-offs arise if algorithms are permitted multiple passes over the stream. Space-Efficient Sampling: Rather than approximating functions of the empirical data in the stream, what can be learned about the source of the stream? Sketchability: A fundamental problem in the data-stream model is to compare two streams. What notions of difference can be estimated in small space? In the process of considering these questions, we present algorithms and lower-bounds for a range of problems including estimating quantiles, various forms of entropy, information divergences, graph-diameter and girth; learning histograms and piecewise-linear probability density functions; and constructing graph-matchings.

Subject Area

Computer science

Recommended Citation

McGregor, Andrew, "Processing data streams" (2007). Dissertations available from ProQuest. AAI3271787.