DAMOV: A new methodology and benchmark suite for evaluating data movement bottlenecks
Data movement between the CPU and main memory is a first-order obstacle against improv
ing performance, scalability, and energy efficiency in modern systems. Computer systems …
ing performance, scalability, and energy efficiency in modern systems. Computer systems …
Griffin: Hardware-software support for efficient page migration in multi-gpu systems
As transistor scaling becomes increasingly more difficult to achieve, scaling the core count
on a single GPU chip has also become extremely challenging. As the volume of data to …
on a single GPU chip has also become extremely challenging. As the volume of data to …
Locality-centric data and threadblock management for massive GPUs
Recent work has shown that building GPUs with hundreds of SMs in a single monolithic chip
will not be practical due to slowing growth in transistor density, low chip yields, and …
will not be practical due to slowing growth in transistor density, low chip yields, and …
Spy in the GPU-box: Covert and side channel attacks on multi-GPU systems
The deep learning revolution has been enabled in large part by GPUs, and more recently
accelerators, which make it possible to carry out computationally demanding training and …
accelerators, which make it possible to carry out computationally demanding training and …
Barre Chord: Efficient Virtual Memory Translation for Multi-Chip-Module GPUs
With the advancement of processor packaging technology and the looming end of Moore's
law, multi-chip-module (MCM) GPUs become a promising architecture to continue the …
law, multi-chip-module (MCM) GPUs become a promising architecture to continue the …
Charon: Specialized near-memory processing architecture for clearing dead objects in memory
Garbage collection (GC) is a standard feature for high productivity programming, saving a
programmer from many nasty memory-related bugs. However, these productivity benefits …
programmer from many nasty memory-related bugs. However, these productivity benefits …
Localityguru: A ptx analyzer for extracting thread block-level locality in gpgpus
Exploiting data locality in GPGPUs is critical for efficiently using the smaller data caches and
handling the memory bottleneck problem. This paper proposes a thread block-centric …
handling the memory bottleneck problem. This paper proposes a thread block-centric …
CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization
Chiplets are transforming computer system designs, allowing system designers to combine
heterogeneous computing resources at unprecedented scales. Breaking larger, mono-lithic …
heterogeneous computing resources at unprecedented scales. Breaking larger, mono-lithic …
Designing virtual memory system of mcm gpus
B Pratheek, N Jawalkar, A Basu - 2022 55th IEEE/ACM …, 2022 - ieeexplore.ieee.org
Multi-Chip Module (MCM) designs have emerged as a key technique to scale up a GPU's
compute capabilities in the face of slowing transistor technology. However, the …
compute capabilities in the face of slowing transistor technology. However, the …
Salus: Efficient Security Support for CXL-Expanded GPU Memory
GPUs have become indispensable accelerators for many data-intensive applications such
as scientific workloads, deep learning models, and graph analytics; these applications share …
as scientific workloads, deep learning models, and graph analytics; these applications share …