Resource Sharing for Machine Learning Serving

Loading...
Thumbnail Image
Degree type
PhD
Graduate group
Computer and Information Science
Discipline
Computer Sciences
Data Science
Subject
Machine Learning
Multiplexing
Resource Sharing
Scheduling
System
Funder
Grant number
License
Copyright date
01/01/2025
Distributor
Related resources
Author
Ng, Kelvin
Contributor
Abstract

The proliferation of machine learning has transformed numerous applications, leading to unprecedented demands on datacenter computation resources. While researchers have made significant strides in improving machine learning serving efficiency through various approaches, the rapid evolution of machine learning continues to pose new challenges. Traditional design principles, which focus on optimizing individual components such as computation kernels, memory usage, and collective communication, are struggling to keep pace with the increasingly integrated, irregular, and massive machine learning models. This dissertation proposes resource sharing as a fundamental design principle to address these emerging challenges. While traditional designs emphasize the performance of individual components, we focus on the interaction among components and identify previously unexplored opportunities for resource sharing. To this end, we introduce novel approaches on computation resource multiplexing and common execution path sharing. This dissertation presents our research across these two aspects. First, we introduce Paella, our software-defined GPU scheduling framework that enables fine-grained control over scheduling to achieve efficient multiplexing of computation resources among different models. Second, we present our novel model serving system that optimizes for the sharing of common execution paths among different inference pipelines through a dynamic model execution system and a data-driven placement optimization algorithm. We further identify future research directions to advance the paradigm of resource sharing: (1) performance and security isolation for practical deployment, and (2) optimized computation-communication overlapping by improving memory locality to reduce stress on the memory system.

Advisor
Liu, Vincent
Date of degree
2025
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Recommended citation