Resource Sharing for Machine Learning Serving

Ng, Kelvin

Resource Sharing for Machine Learning Serving

Files

Ng_upenngdas_0175C_17083.pdf (1.19 MB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Computer and Information Science

Discipline

Computer Sciences
Data Science

Subject

Machine Learning
Multiplexing
Resource Sharing
Scheduling
System

Copyright date

2025

Permalink

https://repository.upenn.edu/handle/20.500.14332/61433

View all metadata

Author

Ng, Kelvin

Abstract

The proliferation of machine learning has transformed numerous applications, leading to unprecedented demands on datacenter computation resources. While researchers have made significant strides in improving machine learning serving efficiency through various approaches, the rapid evolution of machine learning continues to pose new challenges. Traditional design principles, which focus on optimizing individual components such as computation kernels, memory usage, and collective communication, are struggling to keep pace with the increasingly integrated, irregular, and massive machine learning models. This dissertation proposes resource sharing as a fundamental design principle to address these emerging challenges. While traditional designs emphasize the performance of individual components, we focus on the interaction among components and identify previously unexplored opportunities for resource sharing. To this end, we introduce novel approaches on computation resource multiplexing and common execution path sharing. This dissertation presents our research across these two aspects. First, we introduce Paella, our software-defined GPU scheduling framework that enables fine-grained control over scheduling to achieve efficient multiplexing of computation resources among different models. Second, we present our novel model serving system that optimizes for the sharing of common execution paths among different inference pipelines through a dynamic model execution system and a data-driven placement optimization algorithm. We further identify future research directions to advance the paradigm of resource sharing: (1) performance and security isolation for practical deployment, and (2) optimized computation-communication overlapping by improving memory locality to reduce stress on the memory system.

Advisor

Liu, Vincent

Date of degree

2025

Collection

Dissertations and Theses