A survey of cache bypassing techniques
S Mittal - Journal of Low Power Electronics and Applications, 2016 - mdpi.com
With increasing core-count, the cache demand of modern processors has also increased.
However, due to strict area/power budgets and presence of poor data-locality workloads …
However, due to strict area/power budgets and presence of poor data-locality workloads …
Locality-driven dynamic GPU cache bypassing
This paper presents novel cache optimizations for massively parallel, throughput-oriented
architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing …
architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing …
Mask: Redesigning the gpu memory hierarchy to support multi-application concurrency
R Ausavarungnirun, V Miller, J Landgraf… - ACM SIGPLAN …, 2018 - dl.acm.org
Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to
provide high instruction throughput and to efficiently hide long-latency stalls. The resulting …
provide high instruction throughput and to efficiently hide long-latency stalls. The resulting …
Locality-aware CTA clustering for modern GPUs
Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern
GPUs is often awkward. The locality among global memory requests from different SMs …
GPUs is often awkward. The locality among global memory requests from different SMs …
The locality descriptor: A holistic cross-layer abstraction to express data locality in GPUs
Exploiting data locality in GPUs is critical to making more efficient use of the existing caches
and the NUMA-based memory hierarchy expected in future GPUs. While modern GPU …
and the NUMA-based memory hierarchy expected in future GPUs. While modern GPU …
Gnnmark: A benchmark suite to characterize graph neural network training on gpus
Graph Neural Networks (GNNs) have emerged as a promising class of Machine Learning
algorithms to train on non-euclidean data. GNNs are widely used in recommender systems …
algorithms to train on non-euclidean data. GNNs are widely used in recommender systems …
Access pattern-aware cache management for improving data utilization in GPU
Long latency of memory operation is a prominent performance bottleneck in graphics
processing units (GPUs). The small data cache that must be shared across dozens of warps …
processing units (GPUs). The small data cache that must be shared across dozens of warps …
Hardware/software cooperative caching for hybrid DRAM/NVM memory architectures
Non-Volatile Memory (NVM) has recently emerged for its nonvolatility, high density and
energy efficiency. Hybrid memory systems composed of DRAM and NVM have the best of …
energy efficiency. Hybrid memory systems composed of DRAM and NVM have the best of …
A coordinated tiling and batching framework for efficient GEMM on GPUs
General matrix multiplication (GEMM) plays a paramount role in a broad range of domains
such as deep learning, scientific computing, and image processing. The primary …
such as deep learning, scientific computing, and image processing. The primary …
Exploiting inter-warp heterogeneity to improve GPGPU performance
In a GPU, all threads within a warp execute the same instruction in lockstep. For a memory
instruction, this can lead to memory divergence: the memory requests for some threads are …
instruction, this can lead to memory divergence: the memory requests for some threads are …