Provenance and Uncertainty

ROY, SUDEEPA

Provenance and Uncertainty

Files

ROY_upenngdas_0175C_10349.pdf (1.68 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Computer and Information Science

Subject

Algorithms
Databases
Information Extraction
Privacy
Probabilistic Databases
Provenance
Computer Sciences

Copyright date

2014-08-19T20:12:00-07:00

Permalink

https://repository.upenn.edu/handle/20.500.14332/32328

View all metadata

Author

ROY, SUDEEPA

Abstract

Data provenance, a record of the origin and transformation of data, explains how output data is derived from input data. This dissertation focuses on exploring the connection between provenance and uncertainty in two main directions: (1) how a succinct representation of provenance can help infer uncertainty in the input or the output, and (2) how introducing uncertainty can facilitate publishing provenance information while hiding associated private information. A significant fraction of the data found in practice is imprecise, unreliable, and incomplete, and therefore uncertain. The level of uncertainty in the data must be measured and recorded in order to estimate the confidence in the results and find potential sources of error. In probabilistic databases, uncertainty in the input is recorded as a probability distribution, and the goal is to efficiently compute the induced probability distribution on the outputs. In general, this problem is computationally hard, and we seek to expand the class of inputs for which efficient evaluation is possible by exploiting provenance structure. In some scenarios, the output data is directly examined for errors and is labeled accordingly. We need to trace back the errors in the output to the input so that the input can be refined for future processing. Because of incomplete labeling of the output and complexity of the processes generating it, the sources of error may be uncertain. We formalize the problem of source refinement, and propose models and solutions using provenance that can handle incomplete labeling. We also evaluate our solutions empirically for an application of source refinement in information extraction. Data provenance is extensively used to help understand and debug scientific experiments that often involve proprietary and sensitive information. In this dissertation, we consider privacy of proprietary and commercial modules when they belong to a workflow and interact with other modules. We propose a model for module privacy that makes the exact functionality of the modules uncertain by selectively hiding provenance information. We also study the optimization problem of minimizing the information hidden while guaranteeing a desired level of privacy.

Advisor

Susan B. Davidson
Sanjeev Khanna

Date of degree

2012-01-01

Collection

Dissertations and Theses