Zachary Ives is an Assistant Professor at the University of Pennsylvania and an Associated Faculty Member of the Penn Center for Bioinformatics. He received his B.S. from Sonoma State University and his PhD from the University of Washington. His research interests include data integration, peer-to-peer models of data sharing, processing and security of heterogeneous sensor streams, and data exchange between autonomous systems. He is a recipient of the NSF CAREER award and a member of the DARPA Computer Science Study Panel.
Databases, data integration, peer-to-peer computing, sensor networks
Now showing 1 - 10 of 43
PublicationUpdate Exchange With Mappings and Provenance(2007-11-27) Green, Todd J; Karvounarakis, Grigoris; Ives, Zachary G; Tannen, ValWe consider systems for data sharing among heterogeneous peers related by a network of schema mappings. Each peer has a locally controlled and edited database instance, but wants to ask queries over related data from other peers as well. To achieve this, every peer’s updates propagate along the mappings to the other peers. However, this update exchange is filtered by trust conditions — expressing what data and sources a peer judges to be authoritative — which may cause a peer to reject another’s updates. In order to support such filtering, updates carry provenance information. These systems target scientific data sharing applications, and their general principles and architecture have been described in . In this paper we present methods for realizing such systems. Specifically, we extend techniques from data integration, data exchange, and incremental view maintenance to propagate updates along mappings; we integrate a novel model for tracking data provenance, such that curators may filter updates based on trust conditions over this provenance; we discuss strategies for implementing our techniques in conjunction with an RDBMS; and we experimentally demonstrate the viability of our techniques in the Orchestra prototype system. This technical report supersedes the version which appeared in VLDB 2007  and corrects certain technical claims regarding the semantics of our system (see errata in Sections [3.1] and [4.1.1]). PublicationSideways Information Passing for Push-Style Query Processing(2007-11-20) Ives, Zachary G; Taylor, Nicholas EIn many modern data management settings, data is queried from a central node or nodes, but is stored at remote sources. In such a setting it is common to perform "push-style" query processing, using multithreaded pipelined hash joins and bushy query plans to compute parts of the query in parallel; to avoid idling, the CPU can switch between them as delays are encountered. This works well for simple select-project-join queries, but increasingly, Web and integration applications require more complex queries with multiple joins and even nested subqueries. As we demonstrate in this paper, push-style execution of complex queries can be improved substantially via sideways information passing; push-style queries provide many opportunities for information passing that have not been studied in the past literature. We present adaptive information passing, a general runtime decisionmaking technique for reusing intermediate state from one query subresult to prune and reduce computation of other subresults. We develop two alternative schemes for performing adaptive information passing, which we study in several settings under a variety of workloads. PublicationRecursive Computation of Regions and Connectivity in Networks(2008-10-31) Taylor, Nicholas E; Zhou, Wenchao; Ives, Zachary G; Liu, Mengmeng; Loo, Boon ThauIn recent years, data management has begun to consider situations in which data access is closely tied to network routing and distributed acquisition: sensor networks, in which reachability and contiguous regions are of interest; declarative networking, in which shortest paths and reachability are key; distributed and peer-to-peer stream systems, in which we may monitor for associations among data at the distributed sources (e.g., transitive relationships). In each case, the fundamental operation is to maintain a view over dynamic network state; the view is frequently distributed, recursive and may contain aggregation, e.g., describing transitive connectivity, shortest paths, least costly paths, or region membership. Surprisingly, solutions to this problem are often domain-specific, expensive to compute, and incomplete. In this paper, we recast the problem as one of incremental recursive view maintenance in the presence of distributed streams of updates to tuples: new stream data becomes insert operations and tuple expirations become deletions. We develop a set of techniques that maintain information about tuple derivability—a compact form of data provenance. We complement this with techniques to reduce communication: aggregate selections to prune irrelevant aggregation tuples, provenance-aware operators that can determine when tuples are no longer derivable and remove them from their state, and shipping operators that greatly reduce the tuple and provenance information being propagated while still maintaining correct answers. We validate our work in a distributed setting with sensor and network router queries, showing significant gains in bandwidth consumption without sacrificing performance. PublicationIntegrating Ontologies and Relational Data(2007-11-01) Auer, Sören; Ives, Zachary GIn recent years, an increasing number of scientific and other domains have attempted to standardize their terminology and provide reasoning capabilities through ontologies, in order to facilitate data exchange. This has spurred research into Web-based languages, formalisms, and especially query systems based on ontologies. Yet we argue that DBMS techniques can be extended to provide many of the same capabilities, with benefits in scalability and performance. We present OWLDB, a lightweight and extensible approach for the integration of relational databases and description logic based ontologies. One of the key differences between relational databases and ontologies is the high degree of implicit information contained in ontologies. OWLDB integrates the two schemes by codifying ontologies' implicit information using a set of sound and complete inference rules for SHOIN (the description logic behind OWL ontologies. These inference rules can be translated into queries on a relational DBMS instance, and the query results (representing inferences) can be added back to this database. Subsequently, database applications can make direct use of this inferred, previously implicit knowledge, e.g., in the annotation of biomedical databases. As our experimental comparison to a native description logic reasoner and a triple store shows, OWLDB provides significantly greater scalability and query capabilities, without sacrifcing performance with respect to inference. PublicationMOSAIC: Multiple Overlay Selection and Intelligent Composition(2007-10-24) Loo, Boon Thau; Ives, Zachary G; Mao, Yun; Smith, Jonathan MToday, the most effective mechanism for remedying shortcomings of the Internet, or augmenting it with new networking capabilities, is to develop and deploy a new overlay network. This leads to the problem of multiple networking infrastructures, each with independent advantages, and each developed in isolation. A greatly preferable solution is to have a single infrastructure under which new overlays can be developed, deployed, selected, and combined according to application and administrator needs. MOSAIC is an extensible infrastructure that enables not only the specification of new overlay networks, but also dynamic selection and composition of such overlays. MOSAIC provides declarative networking: it uses a unified declarative language (Mozlog) and runtime system to enable specification of new overlay networks, as well as their composition in both the control and data planes. Importantly, it permits dynamic compositions with both existing overlay networks and legacy applications. This paper demonstrates the dynamic selection and composition capabilities of MOSAIC with a variety of declarative overlays: an indirection overlay that supports mobility (i3), a resilient overlay (RON), and a transport-layer proxy. Using a remarkably concise specification, MOSAIC provides the benefits of runtime composition to simultaneously deliver application-aware mobility, NAT traversal and reliability with low performance overhead, demonstrated with deployment and measurement on both a local cluster and the PlanetLab testbed. PublicationA Substrate for In-Network Sensor Data Integration(2008-08-24) Mihaylov, Svilen; Jacob, Marie; Ives, Zachary G; Guha, SudiptoWith the ultimate goal of extending the data integration paradigm and query processing capabilities to ad hoc wireless networks, sensors, and stream systems, we consider how to support communication between sets of nodes performing distributed joins in sensor networks. We develop a communication model that enables in-network join at a variety of locations, and which facilitates coordination among nodes in order to make optimization decisions. While we defer a discussion of the optimizer to future work, we experimentally compare a variety of strategies, including at-base and in-network joins. Results show significant performance gains versus prior work, as well as opportunities for optimization. PublicationInterviewing During a Tight Job Market(2002-09-01) Ives, Zachary G; Luo, QiongVarious tips for interviewing for PhD graduates, seeking an academic position in a research university in Asia or North America are discussed. It is suggested that having the dissertation done before interviews gives a large degree of relief on one's mind. It is found that to be practical about job research package and keep a close eye on applications increases the confidence level. It is also observed that the questions during the talk provides opportunity to clarify and strengthen the talk and show this ability during the interview. PublicationMOSAIC: Unified Platform for Dynamic Overlay Selection and Composition(2008-06-03) Mao, Yun; Loo, Boon Thau; Ives, Zachary G; Smith, Jonathan MMOSAIC constructs new overlay networks with desired characteristics by composing existing overlays with subsets of those attributes. Thus, MOSAIC overcomes the problem of multiple network infrastructures that are partial solutions, while preserving deployability. Composition of control and/or data planes is possible in the system. MOSAIC overlays are specified in Mozlog, a declarative language that specifies overlay properties without binding them to a particular implementation or underlying network. This paper focuses on the runtime aspects of MOSAIC: how it enables interoperability between different overlay networks and how it implements switching between different overlay compositions, permitting dynamic compositions with both existing overlay networks and legacy applications. The system is validated experimentally using declarative overlay compositions concisely specified in Mozlog: an indirection overlay that supports mobility (i3), a resilient overlay (RON), and scalable lookups (Chord), all of which are combined to provide new functionality. MOSAIC provides the benefits of runtime composition to simultaneously deliver application-aware mobility, NAT traversal and reliability with low performance overhead, demonstrated by measurements on both a local cluster and PlanetLab. PublicationA Substrate for In-Network Sensor Data Integration(2008-10-01) Mihaylov, Svilen R; Jacob, Marie; Ives, Zachary G; Guha, SudiptoWith the ultimate goal of extending the data integration paradigm and query processing capabilities to ad hoc wireless networks, sensors, and stream systems, we consider how to support communication between sets of nodes performing distributed joins in sensor networks. We develop a communication model that enables in-network join at a variety of locations, and which facilitates coordination among nodes in order to make optimization decisions. While we defer a discussion of the optimizer to future work, we experimentally compare a variety of strategies, including at-base and in-network joins. Results show significant performance gains versus prior work, as well as opportunities for optimization. PublicationCrossing the Structure Chasm(2003-01-05) Etzioni, Oren; Halevy, Alon; Doan, Anhai; Ives, Zachary G; Madhaven, Jayant; McDowell, Luke; Tatarinov, IgorIt has frequently been observed that most of the world’s data lies outside database systems. The reason is that database systems focus on structured data, leaving the unstructured realm to others. The world of unstructured data has several very appealing properties, such as ease of authoring, querying and data sharing. In contrast, authoring, querying and sharing structured data require significant effort, albeit with the benefit of rich query languages and exact answers. We argue that in order to broaden the use of data management tools, we need a concerted effort to cross this structure chasm, by importing the attractive properties of the unstructured world into the structured one. As an initial effort in this direction, we introduce the REVERE System, which offers several mechanisms for crossing the structure chasm, and considers as its first application the chasm on the WWW.REVERE includes three innovations: (1) a data creation environment that entices people to structure data and enables them to do it rapidly; (2) a data sharing environment, based on a peer data management system, in which a web of data is created by establishing local mappings between schemas, and query answering is done over the transitive closure of these mappings; (3) a novel set of tools that are based on computing statistics over corpora of schemata and structured data. In a sense, we are trying to adapt the key techniques of the unstructured world, namely computing statistics over text coropra, into the world of structured data. We sketch how statistics computed over such corpora, which capture common term usage patterns, can be used to create tools for assisting in schema and mapping development. The initial application of REVERE focuses on creating a web of structured data from data that is usually stored in HTML web pages (e.g., personal information, course information, etc.).