Data prefetching by dependence graph precomputation

S Mittal - ACM Computing Surveys (CSUR), 2016 - dl.acm.org

As the trends of process scaling make memory systems an even more crucial bottleneck, the
importance of latency hiding techniques such as prefetching grows further. However, naively …

被引用次数：144 相关文章所有 3 个版本

[PDF] ieee.org

DAMOV: A new methodology and benchmark suite for evaluating data movement bottlenecks

GF Oliveira, J Gómez-Luna, L Orosa, S Ghose… - IEEE …, 2021 - ieeexplore.ieee.org

Data movement between the CPU and main memory is a first-order obstacle against improv
ing performance, scalability, and energy efficiency in modern systems. Computer systems …

被引用次数：106 相关文章所有 10 个版本

[PDF] ed.ac.uk

Prodigy: Improving the memory latency of data-indirect irregular workloads using hardware-software co-design

N Talati, K May, A Behroozi, Y Yang… - … Symposium on High …, 2021 - ieeexplore.ieee.org

Irregular workloads are typically bottlenecked by the memory system. These workloads often
use sparse data representations, eg, compressed sparse row/column (CSR/CSC), to …

被引用次数：72 相关文章所有 9 个版本

[PDF] researchgate.net

Towards high performance paged memory for GPUs

T Zheng, D Nellans, A Zulfiqar… - … Symposium on High …, 2016 - ieeexplore.ieee.org

Despite industrial investment in both on-die GPUs and next generation interconnects, the
highest performing parallel accelerators shipping today continue to be discrete GPUs …

被引用次数：160 相关文章所有 7 个版本

[PDF] psu.edu

Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors

CK Luk - Proceedings of the 28th annual international …, 2001 - dl.acm.org

Hardly predictable data addresses in many irregular applications have rendered prefetching
ineffective. In many cases, the only accurate way to predict these addresses is to directly …

被引用次数：377 相关文章所有 11 个版本

[PDF] cmu.edu

Accelerating dependent cache misses with an enhanced memory controller

M Hashemi, Khubaib, E Ebrahimi, O Mutlu… - ACM SIGARCH …, 2016 - dl.acm.org

On-chip contention increases memory access latency for multicore processors. We identify
that this additional latency has a substantial efect on performance for an important class of …

被引用次数：134 相关文章所有 11 个版本

[PDF] ipm.ac.ir

Domino temporal data prefetcher

M Bakhshalipour, P Lotfi-Kamran… - … Symposium on High …, 2018 - ieeexplore.ieee.org

Big-data server applications frequently encounter data misses, and hence, lose significant
performance potential. One way to reduce the number of data misses or their effect is data …

被引用次数：115 相关文章所有 3 个版本

[PDF] cmu.edu

Continuous runahead: Transparent hardware acceleration for memory intensive workloads

M Hashemi, O Mutlu, YN Patt - 2016 49th Annual IEEE/ACM …, 2016 - ieeexplore.ieee.org

Runahead execution pre-executes the application's own code to generate new cache
misses. This pre-execution results in prefetch requests that are overwhelmingly accurate …

被引用次数：128 相关文章所有 14 个版本

[PDF] github.io

Analysis and optimization of the memory hierarchy for graph processing workloads

A Basak, S Li, X Hu, SM Oh, X Xie… - … Symposium on High …, 2019 - ieeexplore.ieee.org

Graph processing is an important analysis technique for a wide range of big data
applications. The ability to explicitly represent relationships between entities gives graph …

被引用次数：102 相关文章所有 3 个版本

[PDF] ucsd.edu

Dynamic speculative precomputation

JD Collins, DM Tullsen, H Wang… - Proceedings. 34th ACM …, 2001 - ieeexplore.ieee.org

A large number of memory accesses in memory-bound applications are irregular, such as
pointer dereferences, and can be effectively targeted by thread-based prefetching …

被引用次数：301 相关文章所有 18 个版本