Green, Todd J
Now showing 1 - 10 of 13
PublicationUpdate Exchange With Mappings and Provenance(2007-11-27) Green, Todd J; Karvounarakis, Grigoris; Ives, Zachary G; Tannen, ValWe consider systems for data sharing among heterogeneous peers related by a network of schema mappings. Each peer has a locally controlled and edited database instance, but wants to ask queries over related data from other peers as well. To achieve this, every peer’s updates propagate along the mappings to the other peers. However, this update exchange is filtered by trust conditions — expressing what data and sources a peer judges to be authoritative — which may cause a peer to reject another’s updates. In order to support such filtering, updates carry provenance information. These systems target scientific data sharing applications, and their general principles and architecture have been described in . In this paper we present methods for realizing such systems. Specifically, we extend techniques from data integration, data exchange, and incremental view maintenance to propagate updates along mappings; we integrate a novel model for tracking data provenance, such that curators may filter updates based on trust conditions over this provenance; we discuss strategies for implementing our techniques in conjunction with an RDBMS; and we experimentally demonstrate the viability of our techniques in the Orchestra prototype system. This technical report supersedes the version which appeared in VLDB 2007  and corrects certain technical claims regarding the semantics of our system (see errata in Sections [3.1] and [4.1.1]). PublicationCollaborative Data Sharing with Mappings and Provenance(2009-12-07) Green, Todd JA key challenge in science today involves integrating data from databases managed by different collaborating scientists. In this dissertation, we develop the foundations and applications of collaborative data sharing systems (CDSSs), which address this challenge. A CDSS allows collaborators to define loose confederations of heterogeneous databases, relating them through schema mappings that establish how data should flow from one site to the next. In addition to simply propagating data along the mappings, it is critical to record data provenance (annotations describing where and how data originated) and to support policies allowing scientists to specify whose data they trust, and when. Since a large data sharing confederation is certain to evolve over time, the CDSS must also efficiently handle incremental changes to data, schemas, and mappings. We focus in this dissertation on the formal foundations of CDSSs, as well as practical issues of its implementation in a prototype CDSS called Orchestra. We propose a novel model of data provenance appropriate for CDSSs, based on a framework of semiring-annotated relations. This framework elegantly generalizes a number of other important database semantics involving annotated relations, including ranked results, prior provenance models, and probabilistic databases. We describe the design and implementation of the Orchestra prototype, which supports update propagation across schema mappings while maintaining data provenance and filtering data according to trust policies. We investigate fundamental questions of query containment and equivalence in the context of provenance information. We use the results of these investigations to develop novel approaches to efficiently propagating changes to data and mappings in a CDSS. Our approaches highlight unexpected connections between the two problems and with the problem of optimizing queries using materialized views. Finally, we show that semiring annotations also make sense for XML and nested relational data, paving the way towards a future extension of CDSS to these richer data models. PublicationProvenance Semirings(2007-06-11) Green, Todd J; Karvounarakis, Grigoris; Tannen, ValWe show that relational algebra calculations for incomplete databases, probabilistic databases, bag semantics and why provenance are particular cases of the same general algorithms involving semirings. This further suggests a comprehensive provenance representation that uses semirings of polynomials. We extend these considerations to datalog and semirings of formal power series. We give algorithms for datalog provenance calculation as well as datalog evaluation for incomplete and probabilistic databases. Finally, we show that for some semirings containment of conjunctive queries is the same as for standard set semantics. PublicationProcessing XML Streams with Deterministic Automata(2003-01-01) Green, Todd J; Miklau, Gerome; Onizuka, Makoto; Suciu, DanWe consider the problem of evaluating a large number of XPath expressions on an XML stream. Our main contribution consists in showing that Deterministic Finite Automata (DFA) can be used effectively for this problem: in our experiments we achieve a throughput of about 5.4MB/s, independent of the number of XPath expressions (up to 1,000,000 in our tests). The major problem we face is that of the size of the DFA. Since the number of states grows exponentially with the number of XPath expressions, it was previously believed that DFAs cannot be used to process large sets of expressions. We make a theoretical analysis of the number of the states in the DFA resulting from XPath expressions, and consider both the case when it is constructed eagerly, and when it is constructed lazily. Our analysis indicates that, when the automaton is constructed lazily, and under certain assumptions about the structure of the input XML data, the number of states in the lazy DFA is manageable. We also validate experimentally our findings, on both synthetic and real data sets. PublicationOrchestra: Facilitating Collaborative Data Sharing(2007-06-11) Green, Todd J; Karvounarakis, Grigoris; Taylor, Nicholas E; Biton, Olivier; Ives, Zachary G; Tannen, ValOne of the most elusive goals of structured data management has been sharing among large, heterogeneous populations: while data integration [4, 10] and exchange  are gradually being adopted by corporations or small confederations, little progress has been made in integrating broader communities. Yet the need for large-scale sharing of heterogeneous data is increasing: most of the sciences, particularly biology and astronomy, have become data-driven as they have attempted to tackle larger questions. The field of bioinformatics, in particular, has seen a plethora of different databases emerge: each is focused on a related but subtly different collection of organisms (e.g., CryptoDB, TIGR, FlyNome), genes (GenBank, GeneDB), proteins (UniProt, RCSB Protein Databank), diseases (OMIM, GeneDis), and so on. Such communities have a pressing need to interlink their heterogeneous databases in order to facilitate scientific discovery. PublicationAnnotated XML: Queries and Provenance(2008-06-11) Foster, John N; Green, Todd J; Tannen, ValWe present a formal framework for capturing the provenance of data appearing in XQuery views of XML. Building on previous work on relations and their (positive) query languages, we decorate unordered XML with annotations from commutative semirings and show that these annotations suffice for a large positive fragment of XQuery applied to this data. In addition to tracking provenance metadata, the framework can be used to represent and process XML with repetitions, incomplete XML, and probabilistic XML, and provides a basis for enforcing access control policies in security applications. Each of these applications builds on our semantics for XQuery, which we present in several steps: we generalize the semantics of the Nested Relational Calculus (NRC) to handle semiring-annotated complex values, we extend it with a recursive type and structural recursion operator for trees, and we define a semantics for XQuery on annotated XML by translation into this calculus. PublicationProcessing XML Streams with Deterministic Automata and Stream Indexes(2004-05-11) Green, Todd J; Gupta, Ashish; Miklau, Gerome; Onizuka, Makoto; Suciu, DanWe consider the problem of evaluating a large number of XPath expressions on a stream of XML packets. We contribute two novel techniques. The first is to use a single Deterministic Finite Automaton (DFA). The contribution here is to show that the DFA can be used effectively for this problem: in our experiments we achieve a constant throughput, independently of the number of XPath expressions. The major issue is the size of the DFA, which, in theory, can be exponential in the number of XPath expressions. We provide a series of theoretical results and experimental evaluations that show that the lazy DFA has a small number of states, for all practical purposes. These results are of general interest in XPath processing, beyond stream processing. The second technique is the Streaming IndeX (SIX), which consists of adding a small amount of binary data to each XML packet that allows the query processor to achieve significant speedups. As an application of these techniques we describe the XML Toolkit (XMLTK), a collection of command-line tools providing highly scalable XML data processing. PublicationReconcilable Differences(2009-03-23) Green, Todd J; Ives, Zachary G; Tannen, ValExact query reformulation using views in positive relational languages is well understood, and has a variety of applications in query optimization and data sharing. Generalizations to larger fragments of the relational algebra (RA) --- specifically, support for the difference operator --- would increase the options available for query reformulation, and also apply to view adaptation (updating a materialized view in response to a modified view definition) and view maintenance. Unfortunately, most questions about queries become undecidable in the presence of difference/negation. We present a novel way of managing this difficulty via an excursion through a non-standard semantics, Z-relations, where tuples are annotated with positive or negative integers. We show that under Z-semantics RA queries have a normal form as a single difference of positive queries and this leads to the decidability of equivalence. In most real-world settings with difference, it is possible to convert the queries to this normal form. We give a sound and complete algorithm that explores all reformulations of an RA query (under Z-semantics) using a set of RA views, finitely bounding the search space with a simple and natural cost model. We investigate related complexity questions, and we also extend our results to queries with built-in predicates. Z-relations are interesting in their own right because they capture updates and data uniformly. However, our algorithm turns out to be sound and complete also for bag semantics, albeit necessarily only for a subclass of RA. This subclass turns out to be quite large and covers generously the applications of interest to us. We also show a subclass of RA where reformulation and evaluation under Z-semantics can be combined with duplicate elimination to obtain the answer under set semantics. PublicationContainment of Conjunctive Queries on Annotated Relations(2009-03-23) Green, Todd JWe study containment and equivalence of (unions of) conjunctive queries on relations annotated with elements of a commutative semiring. Such relations and the semantics of positive relational queries on them were introduced in a recent paper as a generalization of set semantics, bag semantics, incomplete databases, and databases annotated with various kinds of provenance information. We obtain positive decidability results and complexity characterizations for databases with lineage, why-provenance, and provenance polynomial annotations, for both conjunctive queries and unions of conjunctive queries. At least one of these results is surprising given that provenance polynomial annotations seem “more expressive” than bag semantics and under the latter, containment of unions of conjunctive queries is known to be undecidable. The decision procedures rely on interesting variations on the notion of containment mappings. We also show that for any positive semiring (a very large class) and conjunctive queries without self-joins, equivalence is the same as isomorphism. PublicationXMLTK: An XML Toolkit for Scalable XML Stream Processing(2002-10-03) Avila-Campillo, Iliana; Green, Todd J; Gupta, Ashish; Onizuka, Makoto; Raven, Demian; Suciu, DanWe describe a toolkit for highly scalable XML data processing, consisting of two components. The first is a collection of stand-alone XML tools, s.a. sort- ing, aggregation, nesting, and unnesting, that can be chained to express more complex restructurings. The second is a highly scalable XPath processor for XML streams that can be used to develop scalable solutions for XML stream applications. In this paper we dis- cuss the tools, and some of the techniques we used to achieve high scalability. The toolkit is freely available as an open-source project.