R3-DLA (reduce, reuse, recycle): A more efficient approach to decoupled look-ahead architectures

S Pruett, Y Patt - MICRO-54: 54th Annual IEEE/ACM International …, 2021 - dl.acm.org

High performance microprocessors require high levels of instruction supply. Branch
prediction has been the most important driver of this for nearly 30 years. Unfortunately …

被引用次数：22 相关文章所有 4 个版本

[PDF] ugent.be

Precise runahead execution

A Naithani, J Feliu, A Adileh… - 2020 IEEE International …, 2020 - ieeexplore.ieee.org

Runahead execution improves processor performance by accurately prefetching long-
latency memory accesses. When a long-latency load causes the instruction window to fill up …

被引用次数：32 相关文章所有 21 个版本

[PDF] acm.org Full View

Graphattack: Optimizing data supply for graph applications on in-order multicore architectures

A Manocha, T Sorensen, E Tureci, O Matthews… - ACM Transactions on …, 2021 - dl.acm.org

Graph structures are a natural representation of important and pervasive data. While graph
applications have significant parallelism, their characteristic pointer indirect loads to …

被引用次数：14 相关文章所有 6 个版本

[PDF] springer.com

A prefetch control strategy based on improved hill-climbing method in asymmetric multi-core architecture

J Fang, Y Xu, H Kong, M Cai - The Journal of Supercomputing, 2023 - Springer

Cache prefetching is a traditional way to reduce memory access latency. In multi-core
systems, aggressive prefetching may harm the system. In the past, prefetching throttling …

被引用次数：6 相关文章所有 6 个版本

[PDF] nsf.gov

Slipstream processors revisited: Exploiting branch sets

V Srinivasan, RBR Chowdhury… - 2020 ACM/IEEE 47th …, 2020 - ieeexplore.ieee.org

Delinquent branches and loads remain key performance limiters in some applications. One
approach to mitigate them is pre-execution. Broadly, there are two classes of pre-execution …

被引用次数：12 相关文章所有 6 个版本

[PDF] utexas.edu

Timely, Efficient, and Accurate Branch Precomputation

A Deshmukh, LC Cai, YN Patt - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org

Out-of-order cores rely on high-accuracy branch predictors to supply useful instructions to
the processor backend. However, there remains a large fraction of mispredictions caused by …

Bootstrapping: Using smt hardware to improve single-thread performance

S Kondguli, M Huang - Proceedings of the Twenty-Fourth International …, 2019 - dl.acm.org

Single-thread performance improvement remains a central design goal for general purpose
processors. Microarchitectural designs for the core have reached a plateau over the past …

被引用次数：8 相关文章所有 9 个版本

[PDF] ejournal.org.cn

[PDF][PDF] 基于指令流混合模式学习的缓存预取算法

王玉庆，杨秋松，李明树 - 电子学报, 2023 - ejournal.org.cn

近期缓存预取算法的研究热点是使用基于模式识别的预测技术, 例如Lookahead,
推算访存请求的地址. 此类算法一方面很难学习访存行为中的依赖缓存失效 …

[图书][B] Optimizing Data Supply and Memory Management for Graph Applications in Post-Moore Hardware-Software Systems

A Manocha - 2023 - search.proquest.com

Graph structures naturally and efficiently capture relationships between entities, such as
individuals in a social network, pages in the World Wide Web, and amino acids in protein …

被引用次数：1 相关文章所有 2 个版本

[PDF] ncsu.edu

[图书][B] Slipstream Processors Revisited: Exploiting Branch Sets

V Srinivasan - 2019 - search.proquest.com

Delinquent branches (frequently mispredict) and loads (frequently miss) remain key IPC
bottlenecks in some applications. One approach to reduce their effect is pre-execution via …

被引用次数：2 相关文章所有 3 个版本