Dual-core execution: Building a highly scalable single-thread instruction window

MK Qureshi, DN Lynch, O Mutlu, YN Patt - ACM SIGARCH Computer …, 2006 - dl.acm.org

Performance loss due to long-latency memory accesses can be reduced by servicing
multiple memory accesses concurrently. The notion of generating and servicing long-latency …

被引用次数：427 相关文章所有 15 个版本

[PDF] ed.ac.uk

Prodigy: Improving the memory latency of data-indirect irregular workloads using hardware-software co-design

N Talati, K May, A Behroozi, Y Yang… - … Symposium on High …, 2021 - ieeexplore.ieee.org

Irregular workloads are typically bottlenecked by the memory system. These workloads often
use sparse data representations, eg, compressed sparse row/column (CSR/CSC), to …

被引用次数：69 相关文章所有 9 个版本

[PDF] cmu.edu

Continuous runahead: Transparent hardware acceleration for memory intensive workloads

M Hashemi, O Mutlu, YN Patt - 2016 49th Annual IEEE/ACM …, 2016 - ieeexplore.ieee.org

Runahead execution pre-executes the application's own code to generate new cache
misses. This pre-execution results in prefetch requests that are overwhelmingly accurate …

被引用次数：127 相关文章所有 14 个版本

[PDF] psu.edu

CPU-assisted GPGPU on fused CPU-GPU architectures

Y Yang, P Xiang, M Mantor… - … Symposium on High …, 2012 - ieeexplore.ieee.org

This paper presents a novel approach to utilize the CPU resource to facilitate the execution
of GPGPU programs on fused CPU-GPU architectures. In our model of fused architectures …

被引用次数：134 相关文章所有 10 个版本

[PDF] researchgate.net

Taxonomy of data prefetching for multicore processors

S Byna, Y Chen, XH Sun - Journal of Computer Science and Technology, 2009 - Springer

Data prefetching is an effective data access latency hiding technique to mask the CPU stall
caused by cache misses and to bridge the performance gap between processor and …

被引用次数：51 相关文章所有 21 个版本

[PDF] acm.org

DeSC: Decoupled supply-compute communication management for heterogeneous architectures

TJ Ham, JL Aragón, M Martonosi - Proceedings of the 48th International …, 2015 - dl.acm.org

Today's computers employ significant heterogeneity to meet performance targets at
manageable power. In adopting increased compute specialization, however, the relative …

被引用次数：77 相关文章所有 15 个版本

[PDF] washington.edu

Extending multicore architectures to exploit hybrid parallelism in single-thread applications

H Zhong, SA Lieberman… - 2007 IEEE 13th …, 2007 - ieeexplore.ieee.org

Chip multiprocessors with multiple simpler cores are gaining popularity because they have
the potential to drive future performance gains without exacerbating the problems of power …

被引用次数：143 相关文章所有 14 个版本

[PDF] desy.de

Multivariate resource performance forecasting in the network weather service

M Swany, R Wolski - SC'02: Proceedings of the 2002 ACM …, 2002 - ieeexplore.ieee.org

This paper describes a new technique in the Network Weather Service for producing multi-
variate forecasts. The new technique uses the NWS's univariate forecasters and emprically …

被引用次数：136 相关文章所有 13 个版本

[PDF] academia.edu

Paceline: Improving single-thread performance in nanoscale cmps through core overclocking

B Greskamp, J Torrellas - 16th International Conference on …, 2007 - ieeexplore.ieee.org

Under current worst-case design practices, manufacturers specify conservative values for
processor frequencies in order to guarantee correctness. To recover some of the lost …

被引用次数：121 相关文章所有 14 个版本

[PDF] utexas.edu

Efficient runahead execution: Power-efficient memory latency tolerance

O Mutlu, H Kim, YN Patt - IEEE Micro, 2006 - ieeexplore.ieee.org

Today's high-performance processors face main-memory latencies on the order of hundreds
of processor clock cycles. As a result, even the most aggressive processors spend a …

被引用次数：94 相关文章所有 12 个版本