Extending Provenance for Deep Diagnosis of Distributed Systems
Diagnosing and repairing problems in complex distributed systems has always been challenging. A wide variety of problems can happen in distributed systems: routers can be misconfigured, nodes can be hacked, and the control software can have bugs. This is further complicated by the complexity and scale of today’s distributed systems. Provenance is an attractive way to diagnose faults in distributed systems, because it can track the causality from a symptom to a set of root causes. Prior work on network provenance has successfully applied provenance to distributed systems. However, they cannot explain problems beyond the presence of faulty events and offer limited help with finding repairs. In this dissertation, we extend provenance to handle diagnostics problems that require deeper investigations. We propose three different extensions: negative provenance explains not just the presence but also the absence of events (such as missing packets); meta provenance can suggest repairs by tracking causality not only for data but also for code (such as bugs in control plane programs); temporal provenance tracks causality at the temporal level and aims at diagnosing timing-related faults (such as slow requests). Compared to classical network provenance, our approach tracks richer causality at runtime and applies more sophisticated reasoning and post-processing. We apply the above techniques to software-defined networking and the border gateway protocol. Evaluations with real world traffic and topology show that our systems can diagnose and repair practical problems, and that the runtime overhead as well as the query turnarounds are reasonable.
Wu, Yang, "Extending Provenance for Deep Diagnosis of Distributed Systems" (2017). Dissertations available from ProQuest. AAI10683194.