Performance portable GPU code generation for matrix multiplication

T Remmelg, T Lutz, M Steuwer, C Dubach - Proceedings of the 9th …, 2016 - dl.acm.org
Proceedings of the 9th Annual Workshop on General Purpose Processing using …, 2016dl.acm.org
Parallel accelerators such as GPUs are notoriously hard to program; exploiting their full
performance potential is a job best left for ninja programmers. High-level programming
languages coupled with optimizing compilers have been proposed to attempt to address this
issue. However, they rely on device-specific heuristics or hard-coded library
implementations to achieve good performance resulting in non-portable solutions that need
to be re-optimized for every new device. Achieving performance portability is the holy grail of …
Parallel accelerators such as GPUs are notoriously hard to program; exploiting their full performance potential is a job best left for ninja programmers. High-level programming languages coupled with optimizing compilers have been proposed to attempt to address this issue. However, they rely on device-specific heuristics or hard-coded library implementations to achieve good performance resulting in non-portable solutions that need to be re-optimized for every new device.
Achieving performance portability is the holy grail of high-performance computing and has so far remained an open problem even for well studied applications like matrix multiplication. We argue that what is needed is a way to describe applications at a high-level without committing to particular implementations. To this end, we developed in a previous paper a functional data-parallel language which allows applications to be expressed in a device neutral way. We use a set of well-defined rewrite rules to automatically transform programs into semantically equivalent device-specific forms, from which OpenCL code is generated.
In this paper, we demonstrate how this approach produces high-performance OpenCL code for GPUs with a well-studied, well-understood application: matrix multiplication. Starting from a single high-level program, our compiler automatically generate highly optimized and specialized implementations. We group simple rewrite rules into more complex macro-rules, each describing a well-known optimization like tiling and register blocking in a composable way. Using an exploration strategy our compiler automatically generates 50,000 OpenCL kernels, each providing a differently optimized -- but provably correct -- implementation of matrix multiplication. The automatically generated code offers competitive performance compared to the manually tuned MAGMA library implementations of matrix multiplication on Nvidia and even outperforms AMD's clBLAS library.
ACM Digital Library
以上显示的是最相近的搜索结果。 查看全部搜索结果