Processing Data-Intensive Workflows in the Cloud

dc.contributor.authorZhang, Zhuoyao
dc.date2023-05-24T09:51:52.487
dc.date.accessioned2023-05-24T10:01:56Z
dc.date.available2023-05-24T10:01:56Z
dc.date.issued2012-01-01
dc.date.submitted2012-04-20T12:05:21-07:00
dc.description.abstractIn the recent years, large-scale data analysis has become critical to the success of modern enterprise. Meanwhile, with the emergence of cloud computing, companies are attracted to move their data analytics tasks to the cloud due to its exible, on demand resources usage and pay-as-you-go pricing model. MapReduce has been widely recognized as an important tool for performing large-scale data analysis in the cloud. It provides a simple and fault-tolerance framework for users to process data-intensive analytics tasks in parallel across dierent physical machines. In this report, we survey alternative implementations of MapReduce, contrasting batched-oriented and pipelined execution models and study how these models impact response times, completion time and robustness. Next, we present three optimization strategies for MapReduce-style work- ows, including (1) scan sharing across MapReduce programs, (2) work- ow optimizations aimed at reducing intermediate data, and (3) schedul- ing policies that map work ow tasks to dierent machines in order to minimize completion times and monetary costs. We conclude with a brief comparison across these optimization strate- gies, and discuss their pros/cons as well as performance implications of using more than one optimization strategy at a time.University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-CIS-12-07.
dc.description.commentsUniversity of Pennsylvania Department of Computer and Information Science Technical Report No. MS-CIS-12-08.
dc.identifier.urihttps://repository.upenn.edu/handle/20.500.14332/49811
dc.legacy.articleid2016
dc.legacy.fulltexturlhttps://repository.upenn.edu/cgi/viewcontent.cgi?article=2016&context=cis_reports&unstamped=1
dc.source.issue970
dc.source.journalTechnical Reports (CIS)
dc.source.statuspublished
dc.titleProcessing Data-Intensive Workflows in the Cloud
dc.typeReport
digcom.identifiercis_reports/970
digcom.identifier.contextkey2785827
digcom.identifier.submissionpathcis_reports/970
digcom.typereport
dspace.entity.typePublication
relation.isAuthorOfPublicationa5880b9c-2a84-4259-8605-9ac8dd16f81e
relation.isAuthorOfPublication.latestForDiscoverya5880b9c-2a84-4259-8605-9ac8dd16f81e
upenn.schoolDepartmentCenterTechnical Reports (CIS)
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
MS_CIS_12_08.pdf
Size:
914.36 KB
Format:
Adobe Portable Document Format
Collection