Static Analysis For Gpu Program Performance

Singhania, Nimit

Static Analysis For Gpu Program Performance

Files

Singhania_upenngdas_0175C_13477.pdf (817.78 KB)

Degree type

Doctor of Philosophy (PhD)

Graduate group

Computer and Information Science

Subject

Compiler Optimization
GPU Performance
Graphics Processing Units
Program Verification
Static Analysis
Computer Sciences

Copyright date

2019-04-02T20:18:00-07:00

Permalink

https://repository.upenn.edu/handle/20.500.14332/30182

View all metadata

Author

Singhania, Nimit

Abstract

GPUs have become popular due to their high computational power. Data scientists rely on GPUs to process loads of data being generated by their systems. From a humble beginning as a graphics accelerator for arcade games, they have become essential compute units in many important applications. The programming infrastructure for GPU programs is still rudimentary and the GPU programmer needs to understand the intricacies of GPU architecture, tune various execution parameters and optimize parts of the program using low-level primitives. GPU compilers are still far from the automation provided by CPU compilers where the programmer is often oblivious of the details of the underlying architecture. In this work, we present light-weight formal approaches to improve performance of general GPU programs. This enables our tools to be fast, correct and accessible to everyone. We present three works. First, we present a compile-time analysis to identify uncoalesced accesses in GPU programs. Uncoalesced accesses are a well-documented memory access pattern that leads to poor performance. Second, we present an analysis to verify block-size independence of GPU programs. Block-size is an execution parameter that must be tuned to optimally utilize GPU resources. We present a static analysis to verify block-size independence for synchronization-free GPU programs and ensure that modifying block-size does not break program functionality. Finally, we present a compile-time optimization to leverage cache reuse in GPU to improve performance of GPU programs. GPUs often abandon cache reuse-based performance improvement in favor of thread-level parallelism, where a large number of threads are executed to hide latency of memory and compute operations. We define a compile-time analysis to identify programs with significant intra-thread locality and little inter-thread locality, where cache resue is useful, and a transformation to modify block-size which indirectly influences the hardware thread-scheduler to improve cache utilization. We have implemented the above approaches in LLVM and evaluate them on various benchmarks. The uncoalesced access analysis identifies 111 accesses, the block-size independence analysis verifies 35 block-size independent kernels and the cache reuse optimization improves performance by an average 1.3x on two Nvidia GPUs. The approaches are fast and finish within few seconds for most programs.

Advisor

Rajeev Alur
Joseph Devietti

Date of degree

2018-01-01

Collection

Dissertations and Theses