Departmental Papers (CIS)

Date of this Version

11-2010

Document Type

Conference Paper

Comments

11th ACM/IFIP/USENIX International Middleware Conference (Middleware), Bangalore, India, Nov 2010

Abstract

This paper examines the feasibility of dynamic rescheduling techniques for effectively utilizing compute resources within a data center. Our work is motivated by practical concerns of Intel’s NetBatch system, an Internet-scale data center based distributed computing platform developed by Intel Corporation for massively parallel chip simulations within the company. NetBatch has been operational for many years, and currently is deployed live on tens of thousands of machines that are globally distributed at various data centers. We perform an analysis of job execution traces obtained over a one year period collected from tens of thousands of NetBatch machines from 20 different pools. Our analysis show that we observe that the NetBatch currently does not make full use of all the resources. Specifically, the job completion time can be severely impacted due to job suspension when higher priority jobs preempt lower priority jobs. We then develop dynamic job rescheduling strategies that adaptively restart jobs to available resources elsewhere, which better utilize system resources and improve completion times. Our trace-driven evaluation results show that dynamic rescheduling enables NetBatch to significantly reduce system waste and completion time of suspended jobs.

Subject Area

CPS Real-Time

Publication Source

11th ACM/IFIP/USENIX International Middleware Conference Industrial Track

Start Page

4

Last Page

10

DOI

10.1145/1891719.1891720

Copyright/Permission Statement

© ACM 2010. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in Proceedings of the 11th ACM/IFIP/USENIX International Middleware Conference Industrial Track, http://dx.doi.org/10.1145/1891719.1891720.

Keywords

Distributed computing, Dynamic rescheduling, Cloud resource management, Trace-driven analysis, Intel NetBatch

Share

COinS
 

Date Posted: 26 April 2011

This document has been peer reviewed.