Processing XML Streams with Deterministic Automata and Stream Indexes

Loading...
Thumbnail Image
Penn collection
Database Research Group (CIS)
Degree type
Discipline
Subject
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Gupta, Ashish
Miklau, Gerome
Onizuka, Makoto
Suciu, Dan
Contributor
Abstract

We consider the problem of evaluating a large number of XPath expressions on a stream of XML packets. We contribute two novel techniques. The first is to use a single Deterministic Finite Automaton (DFA). The contribution here is to show that the DFA can be used effectively for this problem: in our experiments we achieve a constant throughput, independently of the number of XPath expressions. The major issue is the size of the DFA, which, in theory, can be exponential in the number of XPath expressions. We provide a series of theoretical results and experimental evaluations that show that the lazy DFA has a small number of states, for all practical purposes. These results are of general interest in XPath processing, beyond stream processing. The second technique is the Streaming IndeX (SIX), which consists of adding a small amount of binary data to each XML packet that allows the query processor to achieve significant speedups. As an application of these techniques we describe the XML Toolkit (XMLTK), a collection of command-line tools providing highly scalable XML data processing.

Advisor
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Publication date
2004-05-11
Journal title
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Postprint version. Copyright ACM 2004. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Transactions on Database Systems (TODS), Volume 29, Issue 4, December 2004, pages 752-788. Publisher URL: http://doi.acm.org/10.1145/1042046.1042051
Recommended citation
Collection