Taylor, Nicholas E

View all metadata

Search Results

Now showing 1 - 8 of 8

A Distributed Storage and Query Subsystem for Collaborative Data Sharing
(2010-08-13) Taylor, Nicholas E
Cooperative management of data is a difficult challenge. In the absence of a central authority, there is often no single data format, and users may not even agree on what is true and what is not. The data is typically not static and will evolve over time, leading to issues of staleness and conflicting changes. Dedicated machines to run a management system may not be available, and furthermore the machines supplied by the users to run the system may be unreliable or only transiently available. A reliable system must be built over these machines, and should be self-configuring and self-tuning, to avoid placing an undue burden on end users that are unwilling or unable to manage it themselves. The Orchestra collaborative data sharing system responds to these challenges by providing a general approach for propagating updates between a heterogeneous collection of peer databases, which are connected by high-level rules that specify the correspondences between them. The system maintains these correspondences while enforcing trust conditions to filter the data from other databases, maintaining transactional atomicity, and respecting database integrity constraints. In this thesis, I detail my work on the semantics of transactional atomicity and dependency in this context, which lead to a general reconciliation algorithm; I also describe the prototype centralized and peer-to-peer implementations of Orchestra. I then develop a specialized reliable peer-to-peer storage and query processor that will enable the logging and computation needed to maintain an Orchestra instance to be distributed. I show ways to extend this system to recover from node failure, to perform load balancing to ensure even distribution of work, and to compensate for node heterogeneity and data skew.
Maintaining Recursive Views of Regions and Connectivity in Networks
(2010-08-01) Liu, Mengmeng; Taylor, Nicholas E; Zhou, Wenchao; Ives, Zachary G; Loo, Boon Thau
The data management community has recently begun to consider declarative network routing and distributed acquisition: e.g., sensor networks that execute queries about contiguous regions, declarative networks that maintain shortest paths, and distributed and peer-to-peer stream systems that detect transitive relationships among data at the distributed sources. In each case, the fundamental operation is to maintain a view over dynamic network state. This view is typically distributed, recursive, and may contain aggregation, e.g., describing shortest paths or least costly paths. Surprisingly, solutions to computing such views are often domain-specific, expensive, and incomplete. We recast the problem as incremental recursive view maintenance given distributed streams of updates to tuples: new stream data becomes insert operations and tuple expirations become deletions. We develop techniques to maintain compact information about tuple derivability or data provenance. We complement this with techniques to reduce communication: aggregate selections to prune irrelevant aggregation tuples, provenance-aware operators that determine when tuples are no longer derivable and remove them from the view, and shipping operators that reduce the information being propagated while still maintaining correct answers. We validate our work in a distributed setting with sensor and network router queries, showing significant gains in communication overhead without sacrificing performance.
Recursive Computation of Regions and Connectivity in Networks
(2008-10-31) Taylor, Nicholas E; Zhou, Wenchao; Ives, Zachary G; Liu, Mengmeng; Loo, Boon Thau
In recent years, data management has begun to consider situations in which data access is closely tied to network routing and distributed acquisition: sensor networks, in which reachability and contiguous regions are of interest; declarative networking, in which shortest paths and reachability are key; distributed and peer-to-peer stream systems, in which we may monitor for associations among data at the distributed sources (e.g., transitive relationships). In each case, the fundamental operation is to maintain a view over dynamic network state; the view is frequently distributed, recursive and may contain aggregation, e.g., describing transitive connectivity, shortest paths, least costly paths, or region membership. Surprisingly, solutions to this problem are often domain-specific, expensive to compute, and incomplete. In this paper, we recast the problem as one of incremental recursive view maintenance in the presence of distributed streams of updates to tuples: new stream data becomes insert operations and tuple expirations become deletions. We develop a set of techniques that maintain information about tuple derivability—a compact form of data provenance. We complement this with techniques to reduce communication: aggregate selections to prune irrelevant aggregation tuples, provenance-aware operators that can determine when tuples are no longer derivable and remove them from their state, and shipping operators that greatly reduce the tuple and provenance information being propagated while still maintaining correct answers. We validate our work in a distributed setting with sensor and network router queries, showing significant gains in bandwidth consumption without sacrificing performance.
Sideways Information Passing for Push-Style Query Processing
(2008-04-07) Ives, Zachary G; Taylor, Nicholas E
In many modern data management settings, data is queried from a central node or nodes, but is stored at remote sources. In such a setting it is common to perform "pushstyle" query processing, using multi-threaded pipelined hash joins and bushy query plans to compute parts of the query in parallel; to avoid idling, the CPU can switch between them as delays are encountered. This works well for simple select-project join queries, but increasingly, Web and integration applications require more complex queries with multiple joins and even nested subqueries. As we demonstrate in this paper, push-style execution of complex queries can be improved substantially via sideways information passing; push-style queries provide many opportunities for information passing that have not been studied in the past literature. We present adaptive information passing, a general runtime decision-making technique for reusing intermediate state from one query subresult to prune and reduce computation of other subresults. We develop two alternative schemes for performing adaptive information passing, which we study in several settings under a variety of workloads.
Orchestra: Facilitating Collaborative Data Sharing
(2007-06-11) Green, Todd J; Karvounarakis, Grigoris; Taylor, Nicholas E; Biton, Olivier; Ives, Zachary G; Tannen, Val
One of the most elusive goals of structured data management has been sharing among large, heterogeneous populations: while data integration [4, 10] and exchange [3] are gradually being adopted by corporations or small confederations, little progress has been made in integrating broader communities. Yet the need for large-scale sharing of heterogeneous data is increasing: most of the sciences, particularly biology and astronomy, have become data-driven as they have attempted to tackle larger questions. The field of bioinformatics, in particular, has seen a plethora of different databases emerge: each is focused on a related but subtly different collection of organisms (e.g., CryptoDB, TIGR, FlyNome), genes (GenBank, GeneDB), proteins (UniProt, RCSB Protein Databank), diseases (OMIM, GeneDis), and so on. Such communities have a pressing need to interlink their heterogeneous databases in order to facilitate scientific discovery.
Sideways Information Passing for Push-Style Query Processing
(2007-11-20) Ives, Zachary G; Taylor, Nicholas E
In many modern data management settings, data is queried from a central node or nodes, but is stored at remote sources. In such a setting it is common to perform "push-style" query processing, using multithreaded pipelined hash joins and bushy query plans to compute parts of the query in parallel; to avoid idling, the CPU can switch between them as delays are encountered. This works well for simple select-project-join queries, but increasingly, Web and integration applications require more complex queries with multiple joins and even nested subqueries. As we demonstrate in this paper, push-style execution of complex queries can be improved substantially via sideways information passing; push-style queries provide many opportunities for information passing that have not been studied in the past literature. We present adaptive information passing, a general runtime decisionmaking technique for reusing intermediate state from one query subresult to prune and reduce computation of other subresults. We develop two alternative schemes for performing adaptive information passing, which we study in several settings under a variety of workloads.
Recursive Computation of Regions and Connectivity in Networks
(2009-03-29) Liu, Mengmeng; Taylor, Nicholas E; Zhou, Wenchao; Ives, Zachary G; Loo, Boon Thau
In recent years, the data management community has begun to consider situations in which data access is closely tied to network routing and distributed acquisition: examples include, sensor networks that execute queries about reachable nodes or contiguous regions, declarative networks that maintain information about shortest paths and reachable endpoints, and distributed and peer-to-peer stream systems that detect associations (e.g., transitive relationships) among data at the distributed sources. In each case, the fundamental operation is to maintain a view over dynamic network state. This view is typically distributed, recursive, and may contain aggregation, e.g., describing transitive connectivity, shortest paths, least costly paths, or region membership. Surprisingly, solutions to computing such views are often domain-specific, expensive, and incomplete. In this paper, we recast the problem as one of incremental recursive view maintenance in the presence of distributed streams of updates to tuples: new stream data becomes insert operations and tuple expirations become deletions. We develop a set of techniques that maintain compact information about tuple derivability or data provenance. We complement this with techniques to reduce communication: aggregate selections to prune irrelevant aggregation tuples, provenance-aware operators that can determine when tuples are no longer derivable and remove them from their state, and shipping operators that greatly reduce the tuple and provenance information being propagated while still maintaining correct answers. We validate our work in a distributed setting with sensor and network router queries, showing significant gains in communication overhead without sacrificing performance.
Reconciling while Tolerating Disagreement in Collaborative Data Sharing
(2006-06-27) Taylor, Nicholas E; Ives, Zachary G
In many data sharing settings, such as within the biological and biomedical communities, global data consistency is not always attainable: different sites' data may be dirty, uncertain, or even controversial. Collaborators are willing to share their data, and in many cases they also want to selectively import data from others - but must occasionally diverge when they disagree about uncertain or controversial facts or values. For this reason, traditional data sharing and data integration approaches are not applicable, since they require a globally \emph{consistent} data instance. Additionally, many of these approaches do not allow participants to make updates; if they do, concurrency control algorithms or inconsistency repair techniques must be used to ensure a consistent view of the data for all users. In this paper, we develop and present a fully decentralized model of collaborative data sharing, in which participants publish their data on an ad hoc basis and simultaneously reconcile updates with those published by others. Individual updates are associated with provenance information, and each participant accepts only updates with a sufficient authority ranking, meaning that each participant may have a different (though conceptually overlapping) data instance. We define a consistency semantics for database instances under this model of disagreement, present algorithms that perform reconciliation for distributed clusters of participants, and demonstrate their ability to handle typical update and conflict loads in settings involving the sharing of curated data.

Taylor, Nicholas E

Email Address

ORCID

Disciplines

Research Projects

Organizational Units

Position

Introduction

Research Interests

Filters

Author

Subject

Date

Type

Publication Type

Settings

Sort By

Results per page

Search Results

Usage statistics

Penn's Heritage