A case for MLP-aware cache replacement

MK Qureshi, DN Lynch, O Mutlu, YN Patt - ACM SIGARCH Computer …, 2006 - dl.acm.org
Performance loss due to long-latency memory accesses can be reduced by servicing
multiple memory accesses concurrently. The notion of generating and servicing long-latency …

Prodigy: Improving the memory latency of data-indirect irregular workloads using hardware-software co-design

N Talati, K May, A Behroozi, Y Yang… - … Symposium on High …, 2021 - ieeexplore.ieee.org
Irregular workloads are typically bottlenecked by the memory system. These workloads often
use sparse data representations, eg, compressed sparse row/column (CSR/CSC), to …

Continuous runahead: Transparent hardware acceleration for memory intensive workloads

M Hashemi, O Mutlu, YN Patt - 2016 49th Annual IEEE/ACM …, 2016 - ieeexplore.ieee.org
Runahead execution pre-executes the application's own code to generate new cache
misses. This pre-execution results in prefetch requests that are overwhelmingly accurate …

CPU-assisted GPGPU on fused CPU-GPU architectures

Y Yang, P Xiang, M Mantor… - … Symposium on High …, 2012 - ieeexplore.ieee.org
This paper presents a novel approach to utilize the CPU resource to facilitate the execution
of GPGPU programs on fused CPU-GPU architectures. In our model of fused architectures …

Taxonomy of data prefetching for multicore processors

S Byna, Y Chen, XH Sun - Journal of Computer Science and Technology, 2009 - Springer
Data prefetching is an effective data access latency hiding technique to mask the CPU stall
caused by cache misses and to bridge the performance gap between processor and …

DeSC: Decoupled supply-compute communication management for heterogeneous architectures

TJ Ham, JL Aragón, M Martonosi - Proceedings of the 48th International …, 2015 - dl.acm.org
Today's computers employ significant heterogeneity to meet performance targets at
manageable power. In adopting increased compute specialization, however, the relative …

Extending multicore architectures to exploit hybrid parallelism in single-thread applications

H Zhong, SA Lieberman… - 2007 IEEE 13th …, 2007 - ieeexplore.ieee.org
Chip multiprocessors with multiple simpler cores are gaining popularity because they have
the potential to drive future performance gains without exacerbating the problems of power …

Multivariate resource performance forecasting in the network weather service

M Swany, R Wolski - SC'02: Proceedings of the 2002 ACM …, 2002 - ieeexplore.ieee.org
This paper describes a new technique in the Network Weather Service for producing multi-
variate forecasts. The new technique uses the NWS's univariate forecasters and emprically …

Paceline: Improving single-thread performance in nanoscale cmps through core overclocking

B Greskamp, J Torrellas - 16th International Conference on …, 2007 - ieeexplore.ieee.org
Under current worst-case design practices, manufacturers specify conservative values for
processor frequencies in order to guarantee correctness. To recover some of the lost …

Efficient runahead execution: Power-efficient memory latency tolerance

O Mutlu, H Kim, YN Patt - IEEE Micro, 2006 - ieeexplore.ieee.org
Today's high-performance processors face main-memory latencies on the order of hundreds
of processor clock cycles. As a result, even the most aggressive processors spend a …