Fault Management in Distributed Systems

Zhou, Wenchao

Fault Management in Distributed Systems

Files

MS_CIS_10_03.pdf (765.34 KB)

Penn collection

Technical Reports (CIS)

Permalink

https://repository.upenn.edu/handle/20.500.14332/49808

View all metadata

Author

Zhou, Wenchao

Abstract

In the past decade, distributed systems have rapidly evolved, from simple client/server applications in local area networks, to Internet-scale peer-to-peer networks and large-scale cloud platforms deployed on tens of thousands of nodes across multiple administrative domains and geographical areas. Despite of the growing popularity and interests, designing and implementing distributed systems remains challenging, due to their ever- increasing scales and the complexity and unpredictability of the system executions. Fault management strengthens the robustness and security of distributed systems, by detecting malfunctions or violations of desired properties, diagnosing the root causes and maintaining verifiable evidences to demonstrate the diagnosis results. While its importance is well recognized, fault management in distributed systems, on the other hand, is notoriously difficult. To address the problem, various mechanisms and systems have been proposed in the past few years. In this report, we present a survey of these mechanisms and systems, and taxonomize them according to the techniques adopted and their application domains. Based on four representative systems (Pip, Friday, PeerReview and TrInc), we discuss various aspects of fault management, including fault detection, fault diagnosis and evidence generation. Their strength, limitation and application domains are evaluated and compared in detail.

Publication date

2010-01-05

Comments

University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-CIS-10-03.

Collection

Reports