Differencing Provenance in Scientific Workflows

Loading...
Thumbnail Image
Penn collection
Technical Reports (CIS)
Degree type
Discipline
Subject
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Cohen-Boulakia, Sarah
Eyal, Anat
Contributor
Abstract

Scientific workflow management systems are increaingly providing the ability to manage and query the provenance of data products. However, the problem of differencing the provenance of two data products produced by executions of the same specification has not been adequately addressed. Although this problem is NP-hard for general workflow specifications, an analysis of real scientific (and business) workflows shows that their specifications can be captured as series-parallel graphs overlaid with well-nested forking and looping. For this natural restriction, we present efficient, polynomial-time algorithms for differencing executions of the same specification and thereby understanding the difference in the provenance of their data products. We then describe a prototype called PDiffView built around our differencing algorithm. Experimental results demonstrate the scalability of our approach using collected, real workflows and increasingly complex runs.

Advisor
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Publication date
2008-01-01
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-CIS-08-04.
Recommended citation
Collection