Storing, querying and updating XML

Koon Kau Byron Choi, University of Pennsylvania

Abstract

This dissertation focuses on four closely related practical problems of schema-directed XML publishing. First, there has been an increasing need for storing large XML document produced by XML publishing middleware. A space-efficient solution for storing XML is lacking. Second, one of the major reasons of storing the published data in an XML store is that one may prefer to query/retrieve part of the document from the store instead of publishing the data from relational databases again. We investigate a scalable method of querying large XML documents. Third, since the published XML is "cached" outside the relational databases, one needs to maintain it efficiently when the underlying databases are updated. Fourth, published XML can be considered as an interface to the databases. One may want to update underlying databases by updating the published XML. Despite substantial work on publishing relational data in XML format, no-one, to our knowledge, has considered XML publishing with native storage or the appropriate incremental maintenance and update algorithms. To address the first and the second problems, we propose a native approach, namely XML vectorization, to store XML documents. In brief, our storage can be roughly considered as a persistent version of a recent XML compression technique. Furthermore, we derive our query evaluation algorithm on the store by incorporating the graph reduction technique, inspired by functional programming research, and the "lazy" I/O supported by the storage scheme. Through experiments, we verified that XML vectorization works effectively in terms of space used and query performance, on a wide range of XML documents. For the third problem, we derive two solutions: the first pushes incremental computations to the underlying relational databases and requires advanced functionalities of databases; and the second, the bud-cut evaluation, comprises specific optimizations, including indexing and caching mechanism, on XML publishing middleware. Finally, for the last problem, we illustrate how updates on XML can be rewritten into a group of updates on relational views, which represent the published XML document. In addition, we propose a sound algorithm for translating relational view updates into updates on the underlying relational databases. We implemented these algorithms and showed they perform well on a wide range of data sets. The algorithms have been tested on real-life protein data warehouse provided by our collaborators from the European Bioinformatics Institute Hinxton.

Subject Area

Computer science

Recommended Citation

Choi, Koon Kau Byron, "Storing, querying and updating XML" (2006). Dissertations available from ProQuest. AAI3225441.
https://repository.upenn.edu/dissertations/AAI3225441

Share

COinS