Dissecting GPU memory hierarchy through microbenchmarking

X Mei, X Chu - IEEE Transactions on Parallel and Distributed …, 2016 - ieeexplore.ieee.org
Memory access efficiency is a key factor in fully utilizing the computational power of graphics
processing units (GPUs). However, many details of the GPU memory hierarchy are not …

A survey of cache bypassing techniques

S Mittal - Journal of Low Power Electronics and Applications, 2016 - mdpi.com
With increasing core-count, the cache demand of modern processors has also increased.
However, due to strict area/power budgets and presence of poor data-locality workloads …

A framework for memory oversubscription management in graphics processing units

C Li, R Ausavarungnirun, CJ Rossbach… - Proceedings of the …, 2019 - dl.acm.org
Modern discrete GPUs support unified memory and demand paging. Automatic
management of data movement between CPU memory and GPU memory dramatically …

Mask: Redesigning the gpu memory hierarchy to support multi-application concurrency

R Ausavarungnirun, V Miller, J Landgraf… - ACM SIGPLAN …, 2018 - dl.acm.org
Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to
provide high instruction throughput and to efficiently hide long-latency stalls. The resulting …

Graphreduce: processing large-scale graphs on accelerator-based systems

D Sengupta, SL Song, K Agarwal… - Proceedings of the …, 2015 - dl.acm.org
Recent work on real-world graph analytics has sought to leverage the massive amount of
parallelism offered by GPU devices, but challenges remain due to the inherent irregularity of …

Locality-aware CTA clustering for modern GPUs

A Li, SL Song, W Liu, X Liu, A Kumar… - ACM SIGARCH …, 2017 - dl.acm.org
Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern
GPUs is often awkward. The locality among global memory requests from different SMs …

The locality descriptor: A holistic cross-layer abstraction to express data locality in GPUs

N Vijaykumar, E Ebrahimi, K Hsieh… - 2018 ACM/IEEE 45th …, 2018 - ieeexplore.ieee.org
Exploiting data locality in GPUs is critical to making more efficient use of the existing caches
and the NUMA-based memory hierarchy expected in future GPUs. While modern GPU …

Access pattern-aware cache management for improving data utilization in GPU

G Koo, Y Oh, WW Ro, M Annavaram - Proceedings of the 44th annual …, 2017 - dl.acm.org
Long latency of memory operation is a prominent performance bottleneck in graphics
processing units (GPUs). The small data cache that must be shared across dozens of warps …

Adaptive and transparent cache bypassing for GPUs

A Li, GJ van den Braak, A Kumar… - Proceedings of the …, 2015 - dl.acm.org
In the last decade, GPUs have emerged to be widely adopted for general-purpose
applications. To capture on-chip locality for these applications, modern GPUs have …

Exploiting inter-warp heterogeneity to improve GPGPU performance

R Ausavarungnirun, S Ghose, O Kayiran… - 2015 International …, 2015 - ieeexplore.ieee.org
In a GPU, all threads within a warp execute the same instruction in lockstep. For a memory
instruction, this can lead to memory divergence: the memory requests for some threads are …