Coordinated static and dynamic cache bypassing for GPUs

S Mittal - Journal of Low Power Electronics and Applications, 2016 - mdpi.com

With increasing core-count, the cache demand of modern processors has also increased.
However, due to strict area/power budgets and presence of poor data-locality workloads …

被引用次数：49 相关文章所有 10 个版本

[PDF] acm.org

Locality-driven dynamic GPU cache bypassing

C Li, SL Song, H Dai, A Sidelnik, SKS Hari… - Proceedings of the 29th …, 2015 - dl.acm.org

This paper presents novel cache optimizations for massively parallel, throughput-oriented
architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing …

被引用次数：142 相关文章所有 6 个版本

[PDF] acm.org

Mask: Redesigning the gpu memory hierarchy to support multi-application concurrency

R Ausavarungnirun, V Miller, J Landgraf… - ACM SIGPLAN …, 2018 - dl.acm.org

Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to
provide high instruction throughput and to efficiently hide long-latency stalls. The resulting …

被引用次数：117 相关文章所有 26 个版本

[PDF] tu-dresden.de

Locality-aware CTA clustering for modern GPUs

A Li, SL Song, W Liu, X Liu, A Kumar… - ACM SIGARCH …, 2017 - dl.acm.org

Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern
GPUs is often awkward. The locality among global memory requests from different SMs …

被引用次数：94 相关文章所有 13 个版本

[PDF] cmu.edu

The locality descriptor: A holistic cross-layer abstraction to express data locality in GPUs

N Vijaykumar, E Ebrahimi, K Hsieh… - 2018 ACM/IEEE 45th …, 2018 - ieeexplore.ieee.org

Exploiting data locality in GPUs is critical to making more efficient use of the existing caches
and the NUMA-based memory hierarchy expected in future GPUs. While modern GPU …

被引用次数：79 相关文章所有 8 个版本

Gnnmark: A benchmark suite to characterize graph neural network training on gpus

T Baruah, K Shivdikar, S Dong, Y Sun… - … Analysis of Systems …, 2021 - ieeexplore.ieee.org

Graph Neural Networks (GNNs) have emerged as a promising class of Machine Learning
algorithms to train on non-euclidean data. GNNs are widely used in recommender systems …

被引用次数：38 相关文章所有 4 个版本

[PDF] acm.org

Access pattern-aware cache management for improving data utilization in GPU

G Koo, Y Oh, WW Ro, M Annavaram - Proceedings of the 44th annual …, 2017 - dl.acm.org

Long latency of memory operation is a prominent performance bottleneck in graphics
processing units (GPUs). The small data cache that must be shared across dozens of warps …

被引用次数：86 相关文章所有 8 个版本

Hardware/software cooperative caching for hybrid DRAM/NVM memory architectures

H Liu, Y Chen, X Liao, H Jin, B He, L Zheng… - Proceedings of the …, 2017 - dl.acm.org

Non-Volatile Memory (NVM) has recently emerged for its nonvolatility, high density and
energy efficiency. Hybrid memory systems composed of DRAM and NVM have the best of …

被引用次数：80 相关文章所有 2 个版本

[PDF] 115.27.240.201

A coordinated tiling and batching framework for efficient GEMM on GPUs

X Li, Y Liang, S Yan, L Jia, Y Li - Proceedings of the 24th symposium on …, 2019 - dl.acm.org

General matrix multiplication (GEMM) plays a paramount role in a broad range of domains
such as deep learning, scientific computing, and image processing. The primary …

被引用次数：66 相关文章所有 4 个版本

[PDF] cmu.edu

Exploiting inter-warp heterogeneity to improve GPGPU performance

R Ausavarungnirun, S Ghose, O Kayiran… - 2015 International …, 2015 - ieeexplore.ieee.org

In a GPU, all threads within a warp execute the same instruction in lockstep. For a memory
instruction, this can lead to memory divergence: the memory requests for some threads are …

被引用次数：95 相关文章所有 24 个版本