Inter-kernel reuse-aware thread block scheduling

D Tripathy, A Abdolrashidi, LN Bhuyan, L Zhou… - ACM Transactions on …, 2021 - dl.acm.org

The massive parallelism present in GPUs comes at the cost of reduced L1 and L2 cache
sizes per thread, leading to serious cache contention problems such as thrashing. Hence …

被引用次数：34 相关文章所有 6 个版本

[PDF] nsf.gov

Blockmaestro: Enabling programmer-transparent task-based execution in gpu systems

AA Abdolrashidi, HA Esfeden… - 2021 ACM/IEEE 48th …, 2021 - ieeexplore.ieee.org

As modern GPU workloads grow in size and complexity, there is an ever-increasing demand
for GPU computational power. Emerging workloads contain hundreds or thousands of GPU …

被引用次数：19 相关文章所有 8 个版本

[PDF] ucr.edu

Localityguru: A ptx analyzer for extracting thread block-level locality in gpgpus

D Tripathy, A Abdolrashidi, Q Fan… - … and Storage (NAS), 2021 - ieeexplore.ieee.org

Exploiting data locality in GPGPUs is critical for efficiently using the smaller data caches and
handling the memory bottleneck problem. This paper proposes a thread block-centric …

被引用次数：15 相关文章所有 5 个版本

[PDF] wisc.edu

CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization

P Dalmia, RS Kumar, MD Sinclair - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org

Chiplets are transforming computer system designs, allowing system designers to combine
heterogeneous computing resources at unprecedented scales. Breaking larger, mono-lithic …

被引用次数：1 相关文章所有 7 个版本

[PDF] acm.org Full View

Locality-aware cta scheduling for gaming applications

A Ukarande, S Patidar, R Rangan - ACM Transactions on Architecture …, 2021 - dl.acm.org

The compute work rasterizer or the GigaThread Engine of a modern NVIDIA GPU focuses on
maximizing compute work occupancy across all streaming multiprocessors in a GPU while …

被引用次数：12 相关文章所有 2 个版本

[PDF] arxiv.org

Wasp: Warp scheduling to mimic prefetching in graphics workloads

D Joseph, JL Aragón, JM Parcerisa… - arXiv preprint arXiv …, 2024 - arxiv.org

Contemporary GPUs are designed to handle long-latency operations effectively; however,
challenges such as core occupancy (number of warps in a core) and pipeline width can …

被引用次数：2 相关文章所有 2 个版本

Nearest data processing in GPU

H Bitalebi, F Safaei, M Ebrahimi - Sustainable Computing: Informatics and …, 2024 - Elsevier

Memory wall is known as one of the most critical bottlenecks in processors, rooted in the
long memory access delay. With the advent of emerging memory-intensive applications …

Criticality-aware priority to accelerate GPU memory access

H Bitalebi, F Safaei - The Journal of Supercomputing, 2023 - Springer

Graphic processing units (GPU) concept, combined with CUDA and OpenCL programming
models, offers new opportunities to reduce latency and power consumption of throughput …

被引用次数：5 相关文章所有 3 个版本

[PDF] um.es

Dtexl: Decoupled raster pipeline for texture locality

D Joseph, JL Aragón, JM Parcerisa… - 2022 55th IEEE/ACM …, 2022 - ieeexplore.ieee.org

Contemporary GPU architectures have multiple shader cores and a scheduler that
distributes work (threads) among them, focusing on load balancing. These load balancing …

被引用次数：4 相关文章所有 5 个版本

[PDF] hkust.edu.hk

DTC-SpMM: Bridging the Gap in Accelerating General Sparse Matrix Multiplication with Tensor Cores

R Fan, W Wang, X Chu - Proceedings of the 29th ACM International …, 2024 - dl.acm.org

Sparse Matrix-Matrix Multiplication (SpMM) is a building-block operation in scientific
computing and machine learning applications. Recent advancements in hardware, notably …

被引用次数：5 相关文章所有 5 个版本