DSpace

Database Research Group (CIS)

As one of the top database research groups in the US, Penn has made many fundamental contributions to the field -- particularly in areas relating to scientific data management, Web data management, and data provenance. Our research spans from theory to systems and applications, and connects to other research areas within Penn's CIS Department (such as machine learning, programming languages, logic and computation, and approximation algorithms). Within the University, we also collaborate frequently with bioinformatics and genomics.

Department of Computer & Information Science

Search results

Now showing 1 - 10 of 47

Orchestra: Facilitating Collaborative Data Sharing
(2007-06-11) Green, Todd J; Karvounarakis, Grigoris; Taylor, Nicholas E; Biton, Olivier; Ives, Zachary G; Tannen, Val
One of the most elusive goals of structured data management has been sharing among large, heterogeneous populations: while data integration [4, 10] and exchange [3] are gradually being adopted by corporations or small confederations, little progress has been made in integrating broader communities. Yet the need for large-scale sharing of heterogeneous data is increasing: most of the sciences, particularly biology and astronomy, have become data-driven as they have attempted to tackle larger questions. The field of bioinformatics, in particular, has seen a plethora of different databases emerge: each is focused on a related but subtly different collection of organisms (e.g., CryptoDB, TIGR, FlyNome), genes (GenBank, GeneDB), proteins (UniProt, RCSB Protein Databank), diseases (OMIM, GeneDis), and so on. Such communities have a pressing need to interlink their heterogeneous databases in order to facilitate scientific discovery.
BioGuideSRS: Querying Multiple Sources with a user-centric perspective
(2007-01-01) Cohen-Boulakia, Sarah; Biton, Olivier; Davidson, Susan B; Froidevaux, Christine
Summary: Biologists are frequently faced with the problem of integrating information from multiple heterogeneous sources with their own experimental data. Given the large number of public sources, it is difficult to choose which sources to integrate without assistance. When doing this manually, biologists differ in their preferences concerning the sources to be queried as well as the strategies, i.e. the querying process they follow for navigating through the sources. In response to these findings, we have developed BioGuide to assist scientists search for relevant data within external sources while taking their preferences and strategies into account. In this paper, we present BioGuideSRS, a user-friendly system which automatically retrieves instances of data by using BioGuide on top of the SRS system. BioGuideSRS is an Applet that can be run from its web page on any system with Java 5.0. Availability: http://www.bioguide-project.net
Implementing Mapping Composition
(2006-09-01) Bernstein, Philip A; Green, Todd J; Melnik, Sergey; Nash, Alan
Mapping composition is a fundamental operation in metadata driven applications. Given a mapping over schemas S1 and S2 and a mapping over schemas S2 and S3, the composition problem is to compute an equivalent mapping over S1 and S3. We describe a new composition algorithm that targets practical applications. It incorporates view unfolding. It eliminates as many S2 symbols as possible, even if not all can be eliminated. It covers constraints expressed using arbitrary monotone relational operators and, to a lesser extent, non-monotone operators. And it introduces the new technique of left composition. We describe our implementation, explain how to extend it to support user-defined operators, and present experimental results which validate its effectiveness.
Models for Incomplete and Probabilistic Information
(2006-03-01) Green, Todd J; Tannen, Val
Search and Result Presentation in Scientific Workflow Repositories
(2013-05-17) Davidson, Susan; Huang, Xiaocheng; Stoyanovich, Julia; Yuan, Xiaojie
We study the problem of searching a repository of complex hierarchical workflows whose component modules, both composite and atomic, have been annotated with keywords. Since keyword search does not use the graph structure of a workflow, we develop a model of workflows using context-free bag grammars. We then give efficient polynomial-time algorithms that, given a workflow and a keyword query, determine whether some execution of the workflow matches the query. Based on these algorithms we develop a search and ranking solution that efficiently retrieves the top-k grammars from a repository. Finally, we propose a novel result presentation method for grammars matching a keyword query, based on representative parse-trees. The effectiveness of our
Processing XML Streams with Deterministic Automata and Stream Indexes
(2004-05-11) Green, Todd J; Gupta, Ashish; Miklau, Gerome; Onizuka, Makoto; Suciu, Dan
We consider the problem of evaluating a large number of XPath expressions on a stream of XML packets. We contribute two novel techniques. The first is to use a single Deterministic Finite Automaton (DFA). The contribution here is to show that the DFA can be used effectively for this problem: in our experiments we achieve a constant throughput, independently of the number of XPath expressions. The major issue is the size of the DFA, which, in theory, can be exponential in the number of XPath expressions. We provide a series of theoretical results and experimental evaluations that show that the lazy DFA has a small number of states, for all practical purposes. These results are of general interest in XPath processing, beyond stream processing. The second technique is the Streaming IndeX (SIX), which consists of adding a small amount of binary data to each XML packet that allows the query processor to achieve significant speedups. As an application of these techniques we describe the XML Toolkit (XMLTK), a collection of command-line tools providing highly scalable XML data processing.
Provenance in Collaborative Data Sharing
(2009-07-01) Karvounarakis, Grigoris
This dissertation focuses on recording, maintaining and exploiting provenance information in Collaborative Data Sharing Systems (CDSS). These are systems that support data sharing across loosely-coupled, heterogeneous collections of relational databases related by declarative schema mappings. A fundamental challenge in a CDSS is to support the capability of update exchange --- which publishes a participant's updates and then translates others' updates to the participant's local schema and imports them --- while tolerating disagreement between them and recording the provenance of exchanged data, i.e., information about the sources and mappings involved in their propagation. This provenance information can be useful during update exchange, e.g., to evaluate provenance-based trust policies. It can also be exploited after update exchange, to answer a variety of user queries, about the quality, uncertainty or authority of the data, for applications such as trust assessment, ranking for keyword search over databases, or query answering in probabilistic databases. To address these challenges, in this dissertation we develop a novel model of provenance graphs that is informative enough to satisfy the needs of CDSS users and captures the semantics of query answering on various forms of annotated relations. We extend techniques from data integration, data exchange, incremental view maintenance and view update to define the formal semantics of unidirectional and bidirectional update exchange. We develop algorithms to perform update exchange incrementally while maintaining provenance information. We present strategies for implementing our techniques over an RDBMS and experimentally demonstrate their viability in the Orchestra prototype system. We define ProQL, a query language for provenance graphs that can be used by CDSS users to combine data querying with provenance testing as well as to compute annotations for their data, based on their provenance, that are useful for a variety of applications. Finally, we develop a prototype implementation ProQL over an RDBMS and indexing techniques to speed up provenance querying, evaluate experimentally the performance of provenance querying and the benefits of our indexing techniques.
Containment of Conjunctive Queries on Annotated Relations
(2009-03-23) Green, Todd J
We study containment and equivalence of (unions of) conjunctive queries on relations annotated with elements of a commutative semiring. Such relations and the semantics of positive relational queries on them were introduced in a recent paper as a generalization of set semantics, bag semantics, incomplete databases, and databases annotated with various kinds of provenance information. We obtain positive decidability results and complexity characterizations for databases with lineage, why-provenance, and provenance polynomial annotations, for both conjunctive queries and unions of conjunctive queries. At least one of these results is surprising given that provenance polynomial annotations seem “more expressive” than bag semantics and under the latter, containment of unions of conjunctive queries is known to be undecidable. The decision procedures rely on interesting variations on the notion of containment mappings. We also show that for any positive semiring (a very large class) and conjunctive queries without self-joins, equivalence is the same as isomorphism.
Reconcilable Differences
(2009-03-23) Green, Todd J; Ives, Zachary G; Tannen, Val
Exact query reformulation using views in positive relational languages is well understood, and has a variety of applications in query optimization and data sharing. Generalizations to larger fragments of the relational algebra (RA) --- specifically, support for the difference operator --- would increase the options available for query reformulation, and also apply to view adaptation (updating a materialized view in response to a modified view definition) and view maintenance. Unfortunately, most questions about queries become undecidable in the presence of difference/negation. We present a novel way of managing this difficulty via an excursion through a non-standard semantics, Z-relations, where tuples are annotated with positive or negative integers. We show that under Z-semantics RA queries have a normal form as a single difference of positive queries and this leads to the decidability of equivalence. In most real-world settings with difference, it is possible to convert the queries to this normal form. We give a sound and complete algorithm that explores all reformulations of an RA query (under Z-semantics) using a set of RA views, finitely bounding the search space with a simple and natural cost model. We investigate related complexity questions, and we also extend our results to queries with built-in predicates. Z-relations are interesting in their own right because they capture updates and data uniformly. However, our algorithm turns out to be sound and complete also for bag semantics, albeit necessarily only for a subclass of RA. This subclass turns out to be quite large and covers generously the applications of interest to us. We also show a subclass of RA where reformulation and evaluation under Z-semantics can be combined with duplicate elimination to obtain the answer under set semantics.
Modeling and Analysis of Multi-hop Control Networks
(2009-04-13) Rajeev, Alur; Pappas, George James; D'Innocenzo, Alessandro; Weiss, Gera; Johansson, Karl H
We propose a mathematical framework, inspired by the Wireless HART specification, for modeling and analyzing multi-hop communication networks. The framework is designed for systems consisting of multiple control loops closed over a multi-hop communication network. We separate control, topology, routing, and scheduling and propose formal syntax and semantics for the dynamics of the composed system. The main technical contribution of the paper is an explicit translation of multi-hop control networks to switched systems. We describe a Mathematica notebook that automates the translation of multihop control networks to switched systems, and use this tool to show how techniques for analysis of switched systems can be used to address control and networking co-design challenges.

Database Research Group (CIS)

Filters

Author

Subject

Date

Type

Publication Type

Settings

Sort By

Results per page

Search results

Usage statistics

Penn's Heritage