A survey of recent prefetching techniques for processor caches
S Mittal - ACM Computing Surveys (CSUR), 2016 - dl.acm.org
As the trends of process scaling make memory systems an even more crucial bottleneck, the
importance of latency hiding techniques such as prefetching grows further. However, naively …
importance of latency hiding techniques such as prefetching grows further. However, naively …
DAMOV: A new methodology and benchmark suite for evaluating data movement bottlenecks
Data movement between the CPU and main memory is a first-order obstacle against improv
ing performance, scalability, and energy efficiency in modern systems. Computer systems …
ing performance, scalability, and energy efficiency in modern systems. Computer systems …
Prodigy: Improving the memory latency of data-indirect irregular workloads using hardware-software co-design
Irregular workloads are typically bottlenecked by the memory system. These workloads often
use sparse data representations, eg, compressed sparse row/column (CSR/CSC), to …
use sparse data representations, eg, compressed sparse row/column (CSR/CSC), to …
Towards high performance paged memory for GPUs
Despite industrial investment in both on-die GPUs and next generation interconnects, the
highest performing parallel accelerators shipping today continue to be discrete GPUs …
highest performing parallel accelerators shipping today continue to be discrete GPUs …
Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors
CK Luk - Proceedings of the 28th annual international …, 2001 - dl.acm.org
Hardly predictable data addresses in many irregular applications have rendered prefetching
ineffective. In many cases, the only accurate way to predict these addresses is to directly …
ineffective. In many cases, the only accurate way to predict these addresses is to directly …
Accelerating dependent cache misses with an enhanced memory controller
On-chip contention increases memory access latency for multicore processors. We identify
that this additional latency has a substantial efect on performance for an important class of …
that this additional latency has a substantial efect on performance for an important class of …
Domino temporal data prefetcher
M Bakhshalipour, P Lotfi-Kamran… - … Symposium on High …, 2018 - ieeexplore.ieee.org
Big-data server applications frequently encounter data misses, and hence, lose significant
performance potential. One way to reduce the number of data misses or their effect is data …
performance potential. One way to reduce the number of data misses or their effect is data …
Continuous runahead: Transparent hardware acceleration for memory intensive workloads
Runahead execution pre-executes the application's own code to generate new cache
misses. This pre-execution results in prefetch requests that are overwhelmingly accurate …
misses. This pre-execution results in prefetch requests that are overwhelmingly accurate …
Analysis and optimization of the memory hierarchy for graph processing workloads
Graph processing is an important analysis technique for a wide range of big data
applications. The ability to explicitly represent relationships between entities gives graph …
applications. The ability to explicitly represent relationships between entities gives graph …
Dynamic speculative precomputation
JD Collins, DM Tullsen, H Wang… - Proceedings. 34th ACM …, 2001 - ieeexplore.ieee.org
A large number of memory accesses in memory-bound applications are irregular, such as
pointer dereferences, and can be effectively targeted by thread-based prefetching …
pointer dereferences, and can be effectively targeted by thread-based prefetching …