Departmental Papers (CIS)

Enabling Privacy in Provenance-Aware Workflow Systems

Susan B. Davidson, University of Pennsylvania
Sanjeev Khanna, University of Pennsylvania
Sudeepa Roy, University of Pennsylvania
Julia Stoyanovich, University of Pennsylvania
Val Tannen, University of Pennsylvania
Yi Chen, Arizona State University at the Tempe Campus
Tova Milo, Tel Aviv University

Document Type Journal Article

This article is published under a Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits distribution and reproduction in any medium as well allowing derivative works, provided that you attribute the original work to the author(s) and CIDR 2011.

Abstract

A new paradigm for creating and correcting scientific analyses is emerging, that of provenance-aware workflow systems. In such systems, repositories of workflow specifications and of provenance graphs that represent their executions will be made available as part of scientific information sharing. This will allow users to search and query both workflow specifications and their provenance graphs: Scientists who wish to perform new analyses may search workflow repositories to find specifications of interest to reuse or modify. They may also search provenance information to understand the meaning of a workflow, or to debug a specification. Finding erroneous or suspect data, a user may then ask provenance queries to determine what downstream data might have been affected, or to understand how the process failed that led to creating the data. With the increased amount of available provenance information, there is a need to efficiently search and query scientific workflows and their executions. However, workflow authors or owners may wish to keep some information in the repository confidential. For example, intermediate data within an execution may contain sensitive information, such as a social security number, a medical record, or financial information about an individual. Although users with the appropriate access level may be allowed to see such confidential data, making it available to all users, even for scientific purposes, is an unacceptable breach of privacy. Beyond data privacy, a module itself may be proprietary, and hiding its description may not be enough: users without the appropriate access level should not be able to infer its behavior if they are allowed to see the inputs and outputs of the module. Finally, details of how certain modules in the workflow are connected may be proprietary, and so showing how data is passed between modules may reveal too much of the structure of the workflow. There is thus an inherent tradeoff between the utility of the information provided in response to a search/query and the privacy guarantees that This article is published under a Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits distribution and reproduction in any medium as well allowing derivative works, provided that you attribute the original work to the author(s) and CIDR 2011. authors/owners desire. Scientific workflows are gaining wide-spread use in life sciences applications, a domain in which privacy concerns are particularly acute. We now illustrate three types of privacy using an example from this domain. Consider a personalized disease susceptibility workflow in Fig. 1. Information such as an individual’s genetic make-up and family history of disorders, which this workflow takes as input, is highly sensitive and should not be revealed to an unauthorized user, placing stringent requirements on data privacy. Further, a workflow module may compare an individual’s genetic makeup to profiles of other patients and controls. The manner in which such historical data is aggregated and the comparison is made, is highly sensitive, pointing to the need for module privacy. Finally, the fact that disease susceptibility predictions are generated by “calibrating” an individual’s profile against profiles of others may need to be hidden, requiring that workflow structure be kept private. As recently noted in [8], “You are better off designing in security and privacy ... from the start, rather than trying to add them later.”1 We apply this principle by proposing that privacy guarantees should be integrated in the design of the search and query engines that access provenance-aware workflow repositories. Indeed, the alternative would be to create multiple repositories corresponding to different levels of access, which would lead to inconsistencies, inefficiency, and a lack of flexibility, affecting the desired techniques. This paper focuses on privacy-preserving management of provenance-aware workflow systems. We consider the formalization of privacy concerns, as well as query processing in this context. Specifically, we address issues associated with keyword-based search as well as with querying such repositories for structural patterns. To give some background on provenance-aware workflow systems, we first describe the common model for workflow specifications and their executions (Sec. 2). We then enumerate privacy concerns (Sec. 3), consider their effect on query processing, and discuss the challenges (Sec. 4).

 

Date Posted: 24 July 2012