A case for MLP-aware cache replacement
MK Qureshi, DN Lynch, O Mutlu, YN Patt - ACM SIGARCH Computer …, 2006 - dl.acm.org
Performance loss due to long-latency memory accesses can be reduced by servicing
multiple memory accesses concurrently. The notion of generating and servicing long-latency …
multiple memory accesses concurrently. The notion of generating and servicing long-latency …
Prodigy: Improving the memory latency of data-indirect irregular workloads using hardware-software co-design
Irregular workloads are typically bottlenecked by the memory system. These workloads often
use sparse data representations, eg, compressed sparse row/column (CSR/CSC), to …
use sparse data representations, eg, compressed sparse row/column (CSR/CSC), to …
Continuous runahead: Transparent hardware acceleration for memory intensive workloads
Runahead execution pre-executes the application's own code to generate new cache
misses. This pre-execution results in prefetch requests that are overwhelmingly accurate …
misses. This pre-execution results in prefetch requests that are overwhelmingly accurate …
CPU-assisted GPGPU on fused CPU-GPU architectures
Y Yang, P Xiang, M Mantor… - … Symposium on High …, 2012 - ieeexplore.ieee.org
This paper presents a novel approach to utilize the CPU resource to facilitate the execution
of GPGPU programs on fused CPU-GPU architectures. In our model of fused architectures …
of GPGPU programs on fused CPU-GPU architectures. In our model of fused architectures …
Taxonomy of data prefetching for multicore processors
Data prefetching is an effective data access latency hiding technique to mask the CPU stall
caused by cache misses and to bridge the performance gap between processor and …
caused by cache misses and to bridge the performance gap between processor and …
DeSC: Decoupled supply-compute communication management for heterogeneous architectures
Today's computers employ significant heterogeneity to meet performance targets at
manageable power. In adopting increased compute specialization, however, the relative …
manageable power. In adopting increased compute specialization, however, the relative …
Extending multicore architectures to exploit hybrid parallelism in single-thread applications
H Zhong, SA Lieberman… - 2007 IEEE 13th …, 2007 - ieeexplore.ieee.org
Chip multiprocessors with multiple simpler cores are gaining popularity because they have
the potential to drive future performance gains without exacerbating the problems of power …
the potential to drive future performance gains without exacerbating the problems of power …
Multivariate resource performance forecasting in the network weather service
M Swany, R Wolski - SC'02: Proceedings of the 2002 ACM …, 2002 - ieeexplore.ieee.org
This paper describes a new technique in the Network Weather Service for producing multi-
variate forecasts. The new technique uses the NWS's univariate forecasters and emprically …
variate forecasts. The new technique uses the NWS's univariate forecasters and emprically …
Paceline: Improving single-thread performance in nanoscale cmps through core overclocking
B Greskamp, J Torrellas - 16th International Conference on …, 2007 - ieeexplore.ieee.org
Under current worst-case design practices, manufacturers specify conservative values for
processor frequencies in order to guarantee correctness. To recover some of the lost …
processor frequencies in order to guarantee correctness. To recover some of the lost …
Efficient runahead execution: Power-efficient memory latency tolerance
Today's high-performance processors face main-memory latencies on the order of hundreds
of processor clock cycles. As a result, even the most aggressive processors spend a …
of processor clock cycles. As a result, even the most aggressive processors spend a …