Processing XML Streams with Deterministic Automata and Stream Indexes

Thumbnail Image
Penn collection
Database Research Group (CIS)
Degree type
Grant number
Copyright date
Related resources
Gupta, Ashish
Miklau, Gerome
Onizuka, Makoto
Suciu, Dan

We consider the problem of evaluating a large number of XPath expressions on a stream of XML packets. We contribute two novel techniques. The first is to use a single Deterministic Finite Automaton (DFA). The contribution here is to show that the DFA can be used effectively for this problem: in our experiments we achieve a constant throughput, independently of the number of XPath expressions. The major issue is the size of the DFA, which, in theory, can be exponential in the number of XPath expressions. We provide a series of theoretical results and experimental evaluations that show that the lazy DFA has a small number of states, for all practical purposes. These results are of general interest in XPath processing, beyond stream processing. The second technique is the Streaming IndeX (SIX), which consists of adding a small amount of binary data to each XML packet that allows the query processor to achieve significant speedups. As an application of these techniques we describe the XML Toolkit (XMLTK), a collection of command-line tools providing highly scalable XML data processing.

Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Publication date
Journal title
Volume number
Issue number
Publisher DOI
Journal Issue
Postprint version. Copyright ACM 2004. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Transactions on Database Systems (TODS), Volume 29, Issue 4, December 2004, pages 752-788. Publisher URL:
Recommended citation