OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

SM Habib, S Ries, M Muhlhauser - 2010 7th International …, 2010 - ieeexplore.ieee.org

Cloud Computing is an emerging computing paradigm. It shares massively scalable, elastic
resources (eg, data, calculations, and services) transparently among the users over a …

被引用次数：188 相关文章所有 10 个版本

[PDF] arxiv.org

A modern primer on processing in memory

O Mutlu, S Ghose, J Gómez-Luna… - … computing: from devices …, 2022 - Springer

Modern computing systems are overwhelmingly designed to move data to computation. This
design choice goes directly against at least three key trends in computing that cause …

被引用次数：201 相关文章所有 6 个版本

[PDF] arxiv.org

Processing data where it makes sense: Enabling in-memory computation

O Mutlu, S Ghose, J Gómez-Luna… - Microprocessors and …, 2019 - Elsevier

Today's systems are overwhelmingly designed to move data to computation. This design
choice goes directly against at least three key trends in systems that cause performance …

被引用次数：270 相关文章所有 9 个版本

[PDF] illinois.edu

Transparent offloading and mapping (TOM) enabling programmer-transparent near-data processing in GPU systems

K Hsieh, E Ebrahimi, G Kim, N Chatterjee… - ACM SIGARCH …, 2016 - dl.acm.org

Main memory bandwidth is a critical bottleneck for modern GPU systems due to limited off-
chip pin bandwidth. 3D-stacked memory architectures provide a promising opportunity to …

被引用次数：311 相关文章所有 23 个版本

[PDF] acm.org

Scheduling techniques for GPU architectures with processing-in-memory capabilities

A Pattnaik, X Tang, A Jog, O Kayiran… - Proceedings of the …, 2016 - dl.acm.org

Processing data in or near memory (PIM), as opposed to in conventional computational units
in a processor, can greatly alleviate the performance and energy penalties of data transfers …

被引用次数：230 相关文章所有 15 个版本

[PDF] arxiv.org

Figaro: Improving system performance via fine-grained in-dram data relocation and caching

Y Wang, L Orosa, X Peng, Y Guo… - 2020 53rd Annual …, 2020 - ieeexplore.ieee.org

Main memory, composed of DRAM, is a performance bottleneck for many applications, due
to the high DRAM access latency. In-DRAM caches work to mitigate this latency by …

被引用次数：87 相关文章所有 22 个版本

[PDF] uth.gr

Neither more nor less: Optimizing thread-level parallelism for GPGPUs

O Kayıran, A Jog, MT Kandemir… - Proceedings of the 22nd …, 2013 - ieeexplore.ieee.org

General-purpose graphics processing units (GPG-PUs) are at their best in accelerating
computation by exploiting abundant thread-level parallelism (TLP) offered by many classes …

被引用次数：333 相关文章所有 16 个版本

[PDF] susu.ru

[PDF][PDF] Research problems and opportunities in memory systems

O Mutlu, L Subramanian - Supercomputing frontiers and …, 2014 - superfri.susu.ru

The memory system is a fundamental performance and energy bottleneck in almost all
computing systems. Recent system design, application, and technology trends that require …

被引用次数：241 相关文章所有 22 个版本

[PDF] acm.org

Mosaic: a GPU memory manager with application-transparent support for multiple page sizes

R Ausavarungnirun, J Landgraf, V Miller… - Proceedings of the 50th …, 2017 - dl.acm.org

Contemporary discrete GPUs support rich memory management features such as virtual
memory and demand paging. These features simplify GPU programming by providing a …

被引用次数：151 相关文章所有 26 个版本

[PDF] semanticscholar.org

Improving GPGPU resource utilization through alternative thread block scheduling

M Lee, S Song, J Moon, J Kim, W Seo… - 2014 IEEE 20th …, 2014 - ieeexplore.ieee.org

High performance in GPGPU workloads is obtained by maximizing parallelism and fully
utilizing the available resources. The thousands of threads are assigned to each core in …

被引用次数：219 相关文章所有 8 个版本