Reproducibility and Performance Optimizations for Unmodified Linux Programs
Degree type
Graduate group
Discipline
Subject
Operating systems
Program manipulation
Program tracing
Reproducibility
Funder
Grant number
License
Copyright date
Distributor
Related resources
Author
Contributor
Abstract
The demands of modern computing continue to escalate each year. Consumers expect increased performance from systems which must also operate perfectly and never experience issues. These goals are difficult to achieve for any single system, let alone systems in general. System calls are the means by which applications interact with the OS (operating system). Program tracing that takes place at the system call level is an incredibly powerful tool, allowing us to abstract over many details of program execution and reduce programs to the system calls they perform. It allows us to work between the kernel and application levels, which is an easier level of abstraction to reason about, but still general enough that we can look at any program as a series of system calls. In this dissertation, we take a step beyond read-only tracing, and delve into program manipulation via system call interposition, its many applications, and how it allows us to produce program agnostic systems. We fully analyze ptrace, a built-in Linux tool for program tracing and manipulation, and describe its strengths and shortcomings. We also explain how we developed an asynchronous wrapper around the ptrace API which allowed us to circumvent both programmability and performance issues inherent to ptrace. We demonstrate ptrace's utility as a key component of two systems we created: DetTrace and ProcessCache. DetTrace is a reproducible container abstraction for Linux implemented entirely in userspace. All computation that occurs inside a DetTrace container is a pure function of the initial file system state of the container. Reproducible containers can be used for a variety of purposes, including replication for fault-tolerance, reproducible software builds, and reproducible data analytics. We use DetTrace to achieve, in an automatic fashion, reproducibility for 12,130 Debian package builds, containing over 800 million lines of code, as well as bioinformatics and machine learning workflows. We show that, while software in each of these domains is initially irreproducible, DetTrace brings reproducibility without requiring any hardware, OS, or application changes. DetTrace's performance is dictated by the frequency of system calls: I/O-intensive software builds have an average overhead of 3.49x, while compute-bound bioinformatics workflows are under 2%. The ProcessCache system provides a generic facility for automatically memoizing the work of a broad class of multi-process Linux programs. ProcessCache caches results and transparently determines when cached results can be used and when re-execution is necessary. ProcessCache generalizes previous work on forward build systems, to go beyond software builds to other multi-process programs like shell scripts and bioinformatics workflows. ProcessCache supports unmodified Linux binaries, using the ptrace mechanism to trace system calls and determine program inputs. Our experiments show that ProcessCache can automatically provide incremental computation to existing programs, accelerating workloads from 1.06x to 65x. We conclude with an in-depth analysis of future work directions for the two systems. We focus this section on ProcessCache because it comprises the bulk of this dissertation, but first we propose an addition to DetTrace that could alleviate performance and correctness issues it suffers when handling threads. For ProcessCache, we examine potential avenues to improve its performance, space utilization, and correctness guarantees, and also discuss why some previously proposed improvements are not viable solutions.