SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery

Sorin, Daniel J; Martin, Milo; Hill, Mark D; Wood, David A

SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery

Files

01003568.pdf (281.02 KB)

Penn collection

Departmental Papers (CIS)

Permalink

https://repository.upenn.edu/handle/20.500.14332/6358

View all metadata

Author

Sorin, Daniel J

Martin, Milo

Hill, Mark D

Wood, David A

Abstract

We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint/recovery mechanism to support multiple long-latency fault detection schemes. At an abstract level, SafetyNet logically maintains multiple, globally consistent checkpoints of the state of a shared memory multiprocessor (i.e., processors, memory, and coherence permissions), and it recovers to a pre-fault checkpoint of the system and re-executes if a fault is detected. SafetyNet efficiently coordinates checkpoints across the system in logical time and uses "logically atomic" coherence transactions to free checkpoints of transient coherence state. SafetyNet minimizes performance overhead by pipelining checkpoint validation with subsequent parallel execution. We illustrate SafetyNet avoiding system crashes due to either dropped coherence messages or the loss of an interconnection network switch (and its buffered messages). Using full-system simulation of a 16-way multiprocessor running commercial workloads, we find that SafetyNet (a) adds statistically insignificant runtime overhead in the common-case of fault-free execution, and (b) avoids a crash when tolerated faults occur.

Date of presentation

2002-05-25

Conference name

Departmental Papers (CIS)

Conference dates

2023-05-17T00:30:00.000

Comments

Copyright 2002 IEEE. Reprinted from Proceedings of the 29th Annual International Symposium on Computer Architecture, 2002, May 2002, pages 123-134. This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of the University of Pennsylvania's products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to pubs-permissions@ieee.org. By choosing to view this document, you agree to all provisions of the copyright laws protecting it. NOTE: At the time of publication, author Milo Martin was affiliated with the University of Wisconsin. Currently (March 2007), he is a faculty member in the Department of Computer and Information Science at the University of Pennsylvania.

Collection

Presentations