Extending Provenance for Understanding Claims and Data Analyses

Loading...
Thumbnail Image
Degree type
Doctor of Philosophy (PhD)
Graduate group
Computer and Information Science
Discipline
Computer Sciences
Data Science
Subject
Claim provenance
Data lake search
Data provenance
Reasoning
Source inference
Funder
Grant number
License
Copyright date
2022
Distributor
Related resources
Author
Zhang, Yi
Contributor
Abstract

Every day we are bombarded with information, claims, and data --- some of which may be controversial or, at least, opinionated. Yet we need to read the information, make decisions, and take action: for example, should we give our children COVID vaccine booster shots? It is usually difficult for us to evaluate the relevant claims and evidence, e.g., COVID vaccine boosters are safe and effective for anyone 5 years or older, and get conclusions based on them: we may be missing the context that the author of the claim may have, such as the information of the source and derivation of the claim (and its evidence), the data that was consulted when the claim was formulated, and the data omitted in the formulation. Provenance has been proposed in data management systems to describe such contextual information, i.e., the life cycle of the data. This dissertation targets the much-needed contextual information, extends provenance to support different domains, facilitate tracing, enable reasoning and allow interpretation, and proposes techniques to infer it. In this case, people who review the information can have a better understanding of its potential bias. The ideas of the proposed techniques can be applied in two contexts, one oriented around understanding natural language claims and the other one oriented around evaluating data analytic conclusions. The two contexts can be combined to further support assessing the credibility of quantitative claims in natural language.For natural language claims, we propose claim provenance to describe where a claim may come from and explain how it has been derived. We formalize this via a provenance graph and develop a computational framework to infer it, leveraging novel information extraction, text generation, and reasoning techniques. This graph provides provenance for understanding textual claims. For a data analytics-driven report with tables or visualizations, the dissertation focuses on augmenting alternative analysis options that were not disclosed in the report to help users assess whether the data or the data processing steps in the report were “cherry-picked” or representative. To achieve this, we build a search platform over data in a “data lake”, which finds relevant supplementary (joinable or unionable) data with their provenance, serving as potential alternative analysis options. These options provide context beyond data lineage to evaluate the robustness of data analytic conclusions. Finally, for quantitative claims that are informed by data analyses, the dissertation proposes that we need both contextual information mentioned above --- users should not only be able to validate the claims based on the source data, but also be provided with a chance to explore relevant results derived by alternative data analysis options to build a more comprehensive view. Therefore, we propose data provenance as a common building block for these two tasks, and propose to infer it via a ``retrieval-with-reasoning'' framework, considering information from candidate tables and estimating possibly omitted computational steps. This dissertation ends with a vision of what user-facing provenance systems should look like, once augmented with the provenance information our techniques can provide. A prototype system is designed to answer veracity questions about claims in a familiar and tractable domain of reading scientific papers --- namely, the tool helps information reviewers look up data in tables that relate to claims in the paper. A preliminary study shows that, when exposed in a suitable way, provenance information can help reviewers answer veracity questions more efficiently. On the basis of this study, we pose preliminary recommendations for designers of future interfaces that expose context-rich provenance information to reviewers and highlight future opportunities for improving provenance inference techniques.

Advisor
Ives, Zachary, G
Roth, Dan
Date of degree
2022
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation