DAMOV: A new methodology and benchmark suite for evaluating data movement bottlenecks

GF Oliveira, J Gómez-Luna, L Orosa, S Ghose… - IEEE …, 2021 - ieeexplore.ieee.org
Data movement between the CPU and main memory is a first-order obstacle against improv
ing performance, scalability, and energy efficiency in modern systems. Computer systems …

Griffin: Hardware-software support for efficient page migration in multi-gpu systems

T Baruah, Y Sun, AT Dinçer… - … Symposium on High …, 2020 - ieeexplore.ieee.org
As transistor scaling becomes increasingly more difficult to achieve, scaling the core count
on a single GPU chip has also become extremely challenging. As the volume of data to …

Locality-centric data and threadblock management for massive GPUs

M Khairy, V Nikiforov, D Nellans… - 2020 53rd Annual IEEE …, 2020 - ieeexplore.ieee.org
Recent work has shown that building GPUs with hundreds of SMs in a single monolithic chip
will not be practical due to slowing growth in transistor density, low chip yields, and …

Spy in the GPU-box: Covert and side channel attacks on multi-GPU systems

SB Dutta, H Naghibijouybari, A Gupta… - Proceedings of the 50th …, 2023 - dl.acm.org
The deep learning revolution has been enabled in large part by GPUs, and more recently
accelerators, which make it possible to carry out computationally demanding training and …

Barre Chord: Efficient Virtual Memory Translation for Multi-Chip-Module GPUs

Y Feng, S Na, H Kim, H Jeon - 2024 ACM/IEEE 51st Annual …, 2024 - ieeexplore.ieee.org
With the advancement of processor packaging technology and the looming end of Moore's
law, multi-chip-module (MCM) GPUs become a promising architecture to continue the …

Charon: Specialized near-memory processing architecture for clearing dead objects in memory

J Jang, J Heo, Y Lee, J Won, S Kim, SJ Jung… - Proceedings of the …, 2019 - dl.acm.org
Garbage collection (GC) is a standard feature for high productivity programming, saving a
programmer from many nasty memory-related bugs. However, these productivity benefits …

Localityguru: A ptx analyzer for extracting thread block-level locality in gpgpus

D Tripathy, A Abdolrashidi, Q Fan… - … and Storage (NAS), 2021 - ieeexplore.ieee.org
Exploiting data locality in GPGPUs is critical for efficiently using the smaller data caches and
handling the memory bottleneck problem. This paper proposes a thread block-centric …

CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization

P Dalmia, RS Kumar, MD Sinclair - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org
Chiplets are transforming computer system designs, allowing system designers to combine
heterogeneous computing resources at unprecedented scales. Breaking larger, mono-lithic …

Designing virtual memory system of mcm gpus

B Pratheek, N Jawalkar, A Basu - 2022 55th IEEE/ACM …, 2022 - ieeexplore.ieee.org
Multi-Chip Module (MCM) designs have emerged as a key technique to scale up a GPU's
compute capabilities in the face of slowing transistor technology. However, the …

Salus: Efficient Security Support for CXL-Expanded GPU Memory

R Abdullah, H Lee, H Zhou… - 2024 IEEE International …, 2024 - ieeexplore.ieee.org
GPUs have become indispensable accelerators for many data-intensive applications such
as scientific workloads, deep learning models, and graph analytics; these applications share …