Beyond the socket: NUMA-aware GPUs

Y Sun, T Baruah, SA Mojumder, S Dong… - Proceedings of the 46th …, 2019 - dl.acm.org

The rapidly growing popularity and scale of data-parallel workloads demand a
corresponding increase in raw computational power of Graphics Processing Units (GPUs) …

被引用次数：99 相关文章所有 7 个版本

[PDF] gatech.edu

Combining HW/SW mechanisms to improve NUMA performance of multi-GPU systems

V Young, A Jaleel, E Bolotin, E Ebrahimi… - 2018 51st Annual …, 2018 - ieeexplore.ieee.org

Historically, improvement in GPU performance has been tightly coupled with transistor
scaling. As Moore's Law slows down, performance of single GPUs may ultimately plateau …

被引用次数：69 相关文章所有 4 个版本

[PDF] cmu.edu

The locality descriptor: A holistic cross-layer abstraction to express data locality in GPUs

N Vijaykumar, E Ebrahimi, K Hsieh… - 2018 ACM/IEEE 45th …, 2018 - ieeexplore.ieee.org

Exploiting data locality in GPUs is critical to making more efficient use of the existing caches
and the NUMA-based memory hierarchy expected in future GPUs. While modern GPU …

被引用次数：74 相关文章所有 8 个版本

Need for speed: Experiences building a trustworthy system-level gpu simulator

O Villa, D Lustig, Z Yan, E Bolotin, Y Fu… - … Symposium on High …, 2021 - ieeexplore.ieee.org

The demands of high-performance computing (HPC) and machine learning (ML) workloads
have resulted in the rapid architectural evolution of GPUs over the last decade. The growing …

被引用次数：33 相关文章所有 2 个版本

Griffin: Hardware-software support for efficient page migration in multi-gpu systems

T Baruah, Y Sun, AT Dinçer… - … Symposium on High …, 2020 - ieeexplore.ieee.org

As transistor scaling becomes increasingly more difficult to achieve, scaling the core count
on a single GPU chip has also become extremely challenging. As the volume of data to …

被引用次数：45 相关文章所有 4 个版本

[PDF] acm.org

Wire-aware architecture and dataflow for cnn accelerators

S Gudaparthi, S Narayanan… - Proceedings of the …, 2019 - dl.acm.org

In spite of several recent advancements, data movement in modern CNN accelerators
remains a significant bottleneck. Architectures like Eyeriss implement large scratchpads …

被引用次数：44 相关文章所有 5 个版本

[PDF] ugent.be

SAC: Sharing-aware caching in multi-chip GPUs

S Zhang, M Naderan-Tahan, M Jahre… - Proceedings of the 50th …, 2023 - dl.acm.org

Bandwidth non-uniformity in multi-chip GPUs poses a major design challenge for its last-
level cache (LLC) architecture. Whereas a memory-side LLC caches data from the local …

被引用次数：7 相关文章所有 4 个版本

[PDF] ucla.edu

Architecting waferscale processors-a GPU case study

S Pal, D Petrisko, M Tomei, P Gupta… - … Symposium on High …, 2019 - ieeexplore.ieee.org

Increasing communication overheads are already threatening computer system scaling. One
approach to dramatically reduce communication overheads is waferscale processing …

被引用次数：50 相关文章所有 10 个版本

[PDF] arxiv.org

Buddy compression: Enabling larger memory for deep learning and hpc workloads on gpus

E Choukse, MB Sullivan, M O'Connor… - 2020 ACM/IEEE 47th …, 2020 - ieeexplore.ieee.org

GPUs accelerate high-throughput applications, which require orders-of-magnitude higher
memory bandwidth than traditional CPU-only systems. However, the capacity of such high …

被引用次数：46 相关文章所有 9 个版本

[PDF] osti.gov

Negative perceptions about the applicability of source-to-source compilers in hpc: A literature review

R Milewicz, P Pirkelbauer, P Soundararajan… - … Computing: ISC High …, 2021 - Springer

A source-to-source compiler is a type of translator that accepts the source code of a program
written in a programming language as its input and produces an equivalent source code in …

被引用次数：8 相关文章所有 6 个版本