Shared memory multiplexing: A novel way to improve GPGPU throughput

C Li, SL Song, H Dai, A Sidelnik, SKS Hari… - Proceedings of the 29th …, 2015 - dl.acm.org

This paper presents novel cache optimizations for massively parallel, throughput-oriented
architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing …

被引用次数：140 相关文章所有 6 个版本

[PDF] toronto.edu

Zorua: A holistic approach to resource virtualization in GPUs

N Vijaykumar, K Hsieh, G Pekhimenko… - 2016 49th Annual …, 2016 - ieeexplore.ieee.org

This paper introduces a new resource virtualization framework, Zorua, that decouples the
programmer-specified resource usage of a GPU application from the actual allocation in the …

被引用次数：81 相关文章所有 27 个版本

[PDF] ncsu.edu

Warp-level divergence in GPUs: Characterization, impact, and mitigation

P Xiang, Y Yang, H Zhou - 2014 IEEE 20th International …, 2014 - ieeexplore.ieee.org

High throughput architectures rely on high thread-level parallelism (TLP) to hide execution
latencies. In state-of-art graphics processing units (GPUs), threads are organized in a grid of …

被引用次数：89 相关文章所有 10 个版本

[PDF] psu.edu

CUDA-NP: Realizing nested thread-level parallelism in GPGPU applications

Y Yang, H Zhou - ACM SIGPLAN Notices, 2014 - dl.acm.org

Parallel programs consist of series of code sections with different thread-level parallelism
(TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU …

被引用次数：91 相关文章所有 8 个版本

[PDF] purdue.edu

Pagoda: Fine-grained gpu resource virtualization for narrow tasks

TT Yeh, A Sabne, P Sakdhnagool, R Eigenmann… - ACM SIGPLAN …, 2017 - dl.acm.org

Massively multithreaded GPUs achieve high throughput by running thousands of threads in
parallel. To fully utilize the hardware, workloads spawn work to the GPU in bulk by launching …

被引用次数：59 相关文章所有 5 个版本

Virtual thread: Maximizing thread-level parallelism beyond GPU scheduling limit

MK Yoon, K Kim, S Lee, WW Ro… - ACM SIGARCH Computer …, 2016 - dl.acm.org

Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive
amount of processing resources. However, thread concurrency in GPUs can be diminished …

被引用次数：67 相关文章所有 6 个版本

[PDF] academia.edu

NURA: A framework for supporting non-uniform resource accesses in GPUs

S Darabi, N Mahani, H Baxishi… - Proceedings of the …, 2022 - dl.acm.org

Multi-application execution in Graphics Processing Units (GPUs), a promising way to utilize
GPU resources, is still challenging. Some pieces of prior work (eg, spatial multitasking) have …

被引用次数：12 相关文章所有 5 个版本

AEML: An acceleration engine for multi-GPU load-balancing in distributed heterogeneous environment

Z Tang, L Du, X Zhang, L Yang… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org

For the rapid growth computation requirements in big data and artificial intelligence area,
CPU-GPU heterogeneous clusters can provide more powerful computing capacity …

被引用次数：18 相关文章所有 3 个版本

[PDF] ncsu.edu

Enabling efficient preemption for SIMT architectures with lightweight context switching

Z Lin, L Nyland, H Zhou - SC'16: Proceedings of the …, 2016 - ieeexplore.ieee.org

Context switching is a key technique enabling preemption and time-multiplexing for CPUs.
However, for single-instruction multiple-thread (SIMT) processors such as high-end graphics …

被引用次数：48 相关文章所有 4 个版本

[PDF] acm.org

Warp-consolidation: A novel execution model for gpus

A Li, W Liu, L Wang, K Barker, SL Song - Proceedings of the 2018 …, 2018 - dl.acm.org

With the unprecedented development of compute capability and extension of memory
bandwidth on modern GPUs, parallel communication and synchronization soon becomes a …

被引用次数：33 相关文章所有 2 个版本