Orchestrated scheduling and prefetching for GPGPUs

O Mutlu, S Ghose, J Gómez-Luna… - … computing: from devices …, 2022 - Springer

Modern computing systems are overwhelmingly designed to move data to computation. This
design choice goes directly against at least three key trends in computing that cause …

被引用次数：193 相关文章所有 6 个版本

[PDF] arxiv.org

Processing data where it makes sense: Enabling in-memory computation

O Mutlu, S Ghose, J Gómez-Luna… - Microprocessors and …, 2019 - Elsevier

Today's systems are overwhelmingly designed to move data to computation. This design
choice goes directly against at least three key trends in systems that cause performance …

被引用次数：263 相关文章所有 9 个版本

[PDF] ieee.org

DAMOV: A new methodology and benchmark suite for evaluating data movement bottlenecks

GF Oliveira, J Gómez-Luna, L Orosa, S Ghose… - IEEE …, 2021 - ieeexplore.ieee.org

Data movement between the CPU and main memory is a first-order obstacle against improv
ing performance, scalability, and energy efficiency in modern systems. Computer systems …

被引用次数：94 相关文章所有 10 个版本

[PDF] arxiv.org

Figaro: Improving system performance via fine-grained in-dram data relocation and caching

Y Wang, L Orosa, X Peng, Y Guo… - 2020 53rd Annual …, 2020 - ieeexplore.ieee.org

Main memory, composed of DRAM, is a performance bottleneck for many applications, due
to the high DRAM access latency. In-DRAM caches work to mitigate this latency by …

被引用次数：87 相关文章所有 22 个版本

[PDF] uth.gr

Neither more nor less: Optimizing thread-level parallelism for GPGPUs

O Kayıran, A Jog, MT Kandemir… - Proceedings of the 22nd …, 2013 - ieeexplore.ieee.org

General-purpose graphics processing units (GPG-PUs) are at their best in accelerating
computation by exploiting abundant thread-level parallelism (TLP) offered by many classes …

被引用次数：332 相关文章所有 16 个版本

[PDF] susu.ru

[PDF][PDF] Research problems and opportunities in memory systems

O Mutlu, L Subramanian - Supercomputing frontiers and …, 2014 - superfri.susu.ru

The memory system is a fundamental performance and energy bottleneck in almost all
computing systems. Recent system design, application, and technology trends that require …

被引用次数：238 相关文章所有 22 个版本

[PDF] acm.org

Mosaic: a GPU memory manager with application-transparent support for multiple page sizes

R Ausavarungnirun, J Landgraf, V Miller… - Proceedings of the 50th …, 2017 - dl.acm.org

Contemporary discrete GPUs support rich memory management features such as virtual
memory and demand paging. These features simplify GPU programming by providing a …

被引用次数：151 相关文章所有 26 个版本

[PDF] semanticscholar.org

Improving GPGPU resource utilization through alternative thread block scheduling

M Lee, S Song, J Moon, J Kim, W Seo… - 2014 IEEE 20th …, 2014 - ieeexplore.ieee.org

High performance in GPGPU workloads is obtained by maximizing parallelism and fully
utilizing the available resources. The thousands of threads are assigned to each core in …

被引用次数：217 相关文章所有 8 个版本

[PDF] iastate.edu

Divergence-aware warp scheduling

TG Rogers, M O'Connor, TM Aamodt - … of the 46th Annual IEEE/ACM …, 2013 - dl.acm.org

This paper uses hardware thread scheduling to improve the performance and energy
efficiency of divergent applications on GPUs. We propose Divergence-Aware Warp …

被引用次数：192 相关文章所有 11 个版本

[PDF] psu.edu

Coordinated static and dynamic cache bypassing for GPUs

X Xie, Y Liang, Y Wang, G Sun… - 2015 IEEE 21st …, 2015 - ieeexplore.ieee.org

The massive parallel architecture enables graphics processing units (GPUs) to boost
performance for a wide range of applications. Initially, GPUs only employ scratchpad …

被引用次数：165 相关文章所有 10 个版本