Shared memory multiplexing: A novel way to improve GPGPU throughput

Y Yang, P Xiang, M Mantor, N Rubin… - Proceedings of the 21st …, 2012 - dl.acm.org
Y Yang, P Xiang, M Mantor, N Rubin, H Zhou
Proceedings of the 21st international conference on Parallel architectures …, 2012dl.acm.org
On-chip shared memory (aka local data share) is a critical resource to many GPGPU
applications. In current GPUs, the shared memory is allocated when a thread block (also
called a workgroup) is dispatched to a streaming multiprocessor (SM) and is released when
the thread block is completed. As a result, the limited capacity of shared memory becomes a
bottleneck for a GPU to host a high number of thread blocks, limiting the otherwise available
thread-level parallelism (TLP). In this paper, we propose software and/or hardware …
On-chip shared memory (a.k.a. local data share) is a critical resource to many GPGPU applications. In current GPUs, the shared memory is allocated when a thread block (also called a workgroup) is dispatched to a streaming multiprocessor (SM) and is released when the thread block is completed. As a result, the limited capacity of shared memory becomes a bottleneck for a GPU to host a high number of thread blocks, limiting the otherwise available thread-level parallelism (TLP). In this paper, we propose software and/or hardware approaches to multiplex the shared memory among multiple thread blocks.
Our proposed approaches are based on our observation that the current shared memory management reserves shared memory too conservatively, for the entire lifetime of a thread block. If the shared memory is allocated only when it is actually used and freed immediately after, more thread blocks can be hosted in an SM without increasing the shared memory capacity. We propose three software approaches to enable shared memory multiplexing and implement them using a source-to-source compiler. The experimental results show that our proposed software approaches effectively improve the throughput of many GPGPU applications on both NVIDIA GTX285 and GTX480 GPUs (an average of 1.44X on GTX285, 1.70X on GTX480 with 16kB shared memory, and 1.26X on GTX480 with 48kB shared memory). We also propose hardware support for shared memory multiplexing, which incurs minor hardware changes to existing hardware and enables significant performance improvements (an average of 1.53X) to be achieved with very little change in GPGPU code.
ACM Digital Library
以上显示的是最相近的搜索结果。 查看全部搜索结果