Paver: Locality graph-based thread block scheduling for gpus

D Tripathy, A Abdolrashidi, LN Bhuyan, L Zhou… - ACM Transactions on …, 2021 - dl.acm.org
The massive parallelism present in GPUs comes at the cost of reduced L1 and L2 cache
sizes per thread, leading to serious cache contention problems such as thrashing. Hence …

Blockmaestro: Enabling programmer-transparent task-based execution in gpu systems

AA Abdolrashidi, HA Esfeden… - 2021 ACM/IEEE 48th …, 2021 - ieeexplore.ieee.org
As modern GPU workloads grow in size and complexity, there is an ever-increasing demand
for GPU computational power. Emerging workloads contain hundreds or thousands of GPU …

Localityguru: A ptx analyzer for extracting thread block-level locality in gpgpus

D Tripathy, A Abdolrashidi, Q Fan… - … and Storage (NAS), 2021 - ieeexplore.ieee.org
Exploiting data locality in GPGPUs is critical for efficiently using the smaller data caches and
handling the memory bottleneck problem. This paper proposes a thread block-centric …

CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization

P Dalmia, RS Kumar, MD Sinclair - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org
Chiplets are transforming computer system designs, allowing system designers to combine
heterogeneous computing resources at unprecedented scales. Breaking larger, mono-lithic …

Locality-aware cta scheduling for gaming applications

A Ukarande, S Patidar, R Rangan - ACM Transactions on Architecture …, 2021 - dl.acm.org
The compute work rasterizer or the GigaThread Engine of a modern NVIDIA GPU focuses on
maximizing compute work occupancy across all streaming multiprocessors in a GPU while …

Wasp: Warp scheduling to mimic prefetching in graphics workloads

D Joseph, JL Aragón, JM Parcerisa… - arXiv preprint arXiv …, 2024 - arxiv.org
Contemporary GPUs are designed to handle long-latency operations effectively; however,
challenges such as core occupancy (number of warps in a core) and pipeline width can …

Nearest data processing in GPU

H Bitalebi, F Safaei, M Ebrahimi - Sustainable Computing: Informatics and …, 2024 - Elsevier
Memory wall is known as one of the most critical bottlenecks in processors, rooted in the
long memory access delay. With the advent of emerging memory-intensive applications …

Criticality-aware priority to accelerate GPU memory access

H Bitalebi, F Safaei - The Journal of Supercomputing, 2023 - Springer
Graphic processing units (GPU) concept, combined with CUDA and OpenCL programming
models, offers new opportunities to reduce latency and power consumption of throughput …

Dtexl: Decoupled raster pipeline for texture locality

D Joseph, JL Aragón, JM Parcerisa… - 2022 55th IEEE/ACM …, 2022 - ieeexplore.ieee.org
Contemporary GPU architectures have multiple shader cores and a scheduler that
distributes work (threads) among them, focusing on load balancing. These load balancing …

DTC-SpMM: Bridging the Gap in Accelerating General Sparse Matrix Multiplication with Tensor Cores

R Fan, W Wang, X Chu - Proceedings of the 29th ACM International …, 2024 - dl.acm.org
Sparse Matrix-Matrix Multiplication (SpMM) is a building-block operation in scientific
computing and machine learning applications. Recent advancements in hardware, notably …