Static Analysis For Gpu Program Performance
Graphics Processing Units
GPUs have become popular due to their high computational power. Data scientists rely on GPUs to process loads of data being generated by their systems. From a humble beginning as a graphics accelerator for arcade games, they have become essential compute units in many important applications. The programming infrastructure for GPU programs is still rudimentary and the GPU programmer needs to understand the intricacies of GPU architecture, tune various execution parameters and optimize parts of the program using low-level primitives. GPU compilers are still far from the automation provided by CPU compilers where the programmer is often oblivious of the details of the underlying architecture. In this work, we present light-weight formal approaches to improve performance of general GPU programs. This enables our tools to be fast, correct and accessible to everyone. We present three works. First, we present a compile-time analysis to identify uncoalesced accesses in GPU programs. Uncoalesced accesses are a well-documented memory access pattern that leads to poor performance. Second, we present an analysis to verify block-size independence of GPU programs. Block-size is an execution parameter that must be tuned to optimally utilize GPU resources. We present a static analysis to verify block-size independence for synchronization-free GPU programs and ensure that modifying block-size does not break program functionality. Finally, we present a compile-time optimization to leverage cache reuse in GPU to improve performance of GPU programs. GPUs often abandon cache reuse-based performance improvement in favor of thread-level parallelism, where a large number of threads are executed to hide latency of memory and compute operations. We define a compile-time analysis to identify programs with significant intra-thread locality and little inter-thread locality, where cache resue is useful, and a transformation to modify block-size which indirectly influences the hardware thread-scheduler to improve cache utilization. We have implemented the above approaches in LLVM and evaluate them on various benchmarks. The uncoalesced access analysis identifies 111 accesses, the block-size independence analysis verifies 35 block-size independent kernels and the cache reuse optimization improves performance by an average 1.3x on two Nvidia GPUs. The approaches are fast and finish within few seconds for most programs.