A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows

Loading...
Thumbnail Image
Penn collection
Departmental Papers (CIS)
Degree type
Discipline
Subject
databases
workflows
bioinformatics
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Bowers, Shawn
McPhillips, Timothy
Ludascher, Bertram
Cohen, Shirley
Contributor
Abstract

Integrated provenance support promises to be a chief advantage of scientific workflow systems over script-based alternatives. While it is often recognized that information gathered during scientific workflow execution can be used automatically to increase fault tolerance (via checkpointing) and to optimize performance (by reusing intermediate data products in future runs), it is perhaps more significant that provenance information may also be used by scientists to reproduce results from earlier runs, to explain unexpected results, and to prepare results for publication. Current workflow systems offer little or no direct support for these "scientist-oriented" queries of provenance information. Indeed the use of advanced execution models in scientific workflows (e.g., process networks, which exhibit pipeline parallelism over streaming data) and failure to record certain fundamental events such as state resets of processes, can render existing provenance schemas useless for scientific applications of provenance. We develop a simple provenance model that is capable of supporting a wide range of scientific use cases even for complex models of computation such as process networks. Our approach reduces these use cases to database queries over event logs, and is capable of reconstructing complete data and invocation dependency graphs for a workflow run.

Advisor
Date of presentation
2006-05-03
Conference name
Departmental Papers (CIS)
Conference dates
2023-05-17T00:20:35.000
Conference location
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Postprint version. Published in Lecture Notes in Computer Science, Volume 4145, Provenance and Annotation of Data, 2006, pages 133-147. Publisher URL: http://dx.doi.org/10.1007/11890850_15
Recommended citation
Collection