A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps

N Vijaykumar, G Pekhimenko, A Jog… - ACM SIGARCH …, 2015 - dl.acm.org
Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent
execution of thousands of threads. Unfortunately, different bottlenecks during execution and …

Inter-core prefetching for multicore processors using migrating helper threads

M Kamruzzaman, S Swanson, DM Tullsen - Proceedings of the sixteenth …, 2011 - dl.acm.org
Multicore processors have become ubiquitous in today's systems, but exploiting the
parallelism they offer remains difficult, especially for legacy application and applications with …

Morpheus: Extending the last level cache capacity in GPU systems using idle GPU core resources

S Darabi, M Sadrosadati, N Akbarzadeh… - 2022 55th IEEE/ACM …, 2022 - ieeexplore.ieee.org
Graphics Processing Units (GPUs) are widely-used accelerators for data-parallel
applications. In many GPU applications, GPU memory bandwidth bottlenecks performance …

Towards more efficient execution: A decoupled access-execute approach

K Koukos, D Black-Schaffer, V Spiliopoulos… - Proceedings of the 27th …, 2013 - dl.acm.org
The end of Dennard scaling is expected to shrink the range of DVFS in future nodes, limiting
the energy savings of this technique. This paper evaluates how much we can increase the …

Highly concurrent latency-tolerant register files for GPUs

M Sadrosadati, A Mirhosseini, A Hajiabadi… - ACM Transactions on …, 2021 - dl.acm.org
Graphics Processing Units (GPUs) employ large register files to accommodate all active
threads and accelerate context switching. Unfortunately, register files are a scalability …

RLoad: Reputation-based load-balancing network selection strategy for heterogeneous wireless environments

T Bi, R Trestian, GM Muntean - 2013 21st IEEE International …, 2013 - ieeexplore.ieee.org
In the current telecommunication environment, network operators are trying to cope with a
significant increase in data traffic by adopting different solutions to expand their network …

Accelerating sequential applications on CMPs using core spilling

J Cong, G Han, A Jagannathan… - … on Parallel and …, 2007 - ieeexplore.ieee.org
Chip multiprocessors (CMPs) provide a scalable means of exploiting thread-level
parallelism for multitasking or multithreaded applications. However, single-threaded …

Architectural support for thread communications in multi-core processors

S Varoglu, S Jenks - Parallel Computing, 2011 - Elsevier
In the ongoing quest for greater computational power, efficiently exploiting parallelism is of
paramount importance. Architectural trends have shifted from improving single-threaded …

Future execution: A prefetching mechanism that uses multiple cores to speed up single threads

I Ganusov, M Burtscher - ACM Transactions on Architecture and Code …, 2006 - dl.acm.org
This paper describes future execution (FE), a simple hardware-only technique to accelerate
individual program threads running on multicore microprocessors. Our approach uses …

The case for domain-specialized branch predictors for graph-processing

A Samara, J Tuck - IEEE Computer Architecture Letters, 2020 - ieeexplore.ieee.org
Branch prediction is believed by many to be a solved problem, with state-of-the-art
predictors achieving near-perfect prediction for many programs. In this article, we conduct a …