Locality-driven dynamic GPU cache bypassing

C Li, SL Song, H Dai, A Sidelnik, SKS Hari… - Proceedings of the 29th …, 2015 - dl.acm.org
This paper presents novel cache optimizations for massively parallel, throughput-oriented
architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing …

Zorua: A holistic approach to resource virtualization in GPUs

N Vijaykumar, K Hsieh, G Pekhimenko… - 2016 49th Annual …, 2016 - ieeexplore.ieee.org
This paper introduces a new resource virtualization framework, Zorua, that decouples the
programmer-specified resource usage of a GPU application from the actual allocation in the …

Warp-level divergence in GPUs: Characterization, impact, and mitigation

P Xiang, Y Yang, H Zhou - 2014 IEEE 20th International …, 2014 - ieeexplore.ieee.org
High throughput architectures rely on high thread-level parallelism (TLP) to hide execution
latencies. In state-of-art graphics processing units (GPUs), threads are organized in a grid of …

CUDA-NP: Realizing nested thread-level parallelism in GPGPU applications

Y Yang, H Zhou - ACM SIGPLAN Notices, 2014 - dl.acm.org
Parallel programs consist of series of code sections with different thread-level parallelism
(TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU …

Pagoda: Fine-grained gpu resource virtualization for narrow tasks

TT Yeh, A Sabne, P Sakdhnagool, R Eigenmann… - ACM SIGPLAN …, 2017 - dl.acm.org
Massively multithreaded GPUs achieve high throughput by running thousands of threads in
parallel. To fully utilize the hardware, workloads spawn work to the GPU in bulk by launching …

Virtual thread: Maximizing thread-level parallelism beyond GPU scheduling limit

MK Yoon, K Kim, S Lee, WW Ro… - ACM SIGARCH Computer …, 2016 - dl.acm.org
Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive
amount of processing resources. However, thread concurrency in GPUs can be diminished …

NURA: A framework for supporting non-uniform resource accesses in GPUs

S Darabi, N Mahani, H Baxishi… - Proceedings of the …, 2022 - dl.acm.org
Multi-application execution in Graphics Processing Units (GPUs), a promising way to utilize
GPU resources, is still challenging. Some pieces of prior work (eg, spatial multitasking) have …

AEML: An acceleration engine for multi-GPU load-balancing in distributed heterogeneous environment

Z Tang, L Du, X Zhang, L Yang… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
For the rapid growth computation requirements in big data and artificial intelligence area,
CPU-GPU heterogeneous clusters can provide more powerful computing capacity …

Enabling efficient preemption for SIMT architectures with lightweight context switching

Z Lin, L Nyland, H Zhou - SC'16: Proceedings of the …, 2016 - ieeexplore.ieee.org
Context switching is a key technique enabling preemption and time-multiplexing for CPUs.
However, for single-instruction multiple-thread (SIMT) processors such as high-end graphics …

Warp-consolidation: A novel execution model for gpus

A Li, W Liu, L Wang, K Barker, SL Song - Proceedings of the 2018 …, 2018 - dl.acm.org
With the unprecedented development of compute capability and extension of memory
bandwidth on modern GPUs, parallel communication and synchronization soon becomes a …