Paver: Locality graph-based thread block scheduling for gpus
The massive parallelism present in GPUs comes at the cost of reduced L1 and L2 cache
sizes per thread, leading to serious cache contention problems such as thrashing. Hence …
sizes per thread, leading to serious cache contention problems such as thrashing. Hence …
Blockmaestro: Enabling programmer-transparent task-based execution in gpu systems
AA Abdolrashidi, HA Esfeden… - 2021 ACM/IEEE 48th …, 2021 - ieeexplore.ieee.org
As modern GPU workloads grow in size and complexity, there is an ever-increasing demand
for GPU computational power. Emerging workloads contain hundreds or thousands of GPU …
for GPU computational power. Emerging workloads contain hundreds or thousands of GPU …
Localityguru: A ptx analyzer for extracting thread block-level locality in gpgpus
Exploiting data locality in GPGPUs is critical for efficiently using the smaller data caches and
handling the memory bottleneck problem. This paper proposes a thread block-centric …
handling the memory bottleneck problem. This paper proposes a thread block-centric …
CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization
Chiplets are transforming computer system designs, allowing system designers to combine
heterogeneous computing resources at unprecedented scales. Breaking larger, mono-lithic …
heterogeneous computing resources at unprecedented scales. Breaking larger, mono-lithic …
Locality-aware cta scheduling for gaming applications
A Ukarande, S Patidar, R Rangan - ACM Transactions on Architecture …, 2021 - dl.acm.org
The compute work rasterizer or the GigaThread Engine of a modern NVIDIA GPU focuses on
maximizing compute work occupancy across all streaming multiprocessors in a GPU while …
maximizing compute work occupancy across all streaming multiprocessors in a GPU while …
Wasp: Warp scheduling to mimic prefetching in graphics workloads
Contemporary GPUs are designed to handle long-latency operations effectively; however,
challenges such as core occupancy (number of warps in a core) and pipeline width can …
challenges such as core occupancy (number of warps in a core) and pipeline width can …
Nearest data processing in GPU
Memory wall is known as one of the most critical bottlenecks in processors, rooted in the
long memory access delay. With the advent of emerging memory-intensive applications …
long memory access delay. With the advent of emerging memory-intensive applications …
Criticality-aware priority to accelerate GPU memory access
H Bitalebi, F Safaei - The Journal of Supercomputing, 2023 - Springer
Graphic processing units (GPU) concept, combined with CUDA and OpenCL programming
models, offers new opportunities to reduce latency and power consumption of throughput …
models, offers new opportunities to reduce latency and power consumption of throughput …
Dtexl: Decoupled raster pipeline for texture locality
Contemporary GPU architectures have multiple shader cores and a scheduler that
distributes work (threads) among them, focusing on load balancing. These load balancing …
distributes work (threads) among them, focusing on load balancing. These load balancing …
DTC-SpMM: Bridging the Gap in Accelerating General Sparse Matrix Multiplication with Tensor Cores
Sparse Matrix-Matrix Multiplication (SpMM) is a building-block operation in scientific
computing and machine learning applications. Recent advancements in hardware, notably …
computing and machine learning applications. Recent advancements in hardware, notably …