Improving GPGPU concurrency with elastic kernels

S Pai, MJ Thazhuthaveetil… - ACM SIGARCH Computer …, 2013 - dl.acm.org
Each new generation of GPUs vastly increases the resources available to GPGPU
programs. GPU programming models (like CUDA) were designed to scale to use these …

Habanero-Java: the new adventures of old X10

V Cavé, J Zhao, J Shirako, V Sarkar - Proceedings of the 9th …, 2011 - dl.acm.org
In this paper, we present the Habanero-Java (HJ) language developed at Rice University as
an extension to the original Java-based definition of the X10 language. HJ includes a …

Software challenges in extreme scale systems

V Sarkar, W Harrod, AE Snavely - Journal of Physics: Conference …, 2009 - iopscience.iop.org
Computer systems anticipated in the 2015–2020 timeframe are referred to as Extreme Scale
because they will be built using massive multi-core processors with 100's of cores per chip …

[PDF][PDF] Exascale software study: Software challenges in extreme scale systems

S Amarasinghe, D Campbell, W Carlson… - DARPA IPTO, Air Force …, 2009 - Citeseer
Extreme Scale processors containing hundreds or even thousands of cores will challenge
current operating system (OS) practices. Many of the fundamental assumptions that underlie …

Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs

JA Stratton, V Grover, J Marathe, B Aarts… - Proceedings of the 8th …, 2010 - dl.acm.org
In this paper we describe techniques for compiling fine-grained SPMD-threaded programs,
expressed in programming models such as OpenCL or CUDA, to multicore execution …

Intermediate representations for explicitly parallel programs

A Susungi, C Tadonki - ACM Computing Surveys (CSUR), 2021 - dl.acm.org
While compilers generally support parallel programming languages and APIs, their internal
program representations are mostly designed from the sequential programs standpoint …

COX: Exposing CUDA warp-level functions to CPUs

R Han, J Lee, J Sim, H Kim - ACM Transactions on Architecture and …, 2022 - dl.acm.org
As CUDA becomes the de facto programming language among data parallel applications
such as high-performance computing or machine learning applications, running CUDA on …

Architectural support for task dependence management with flexible software scheduling

E Castillo, L Alvarez, M Moreto, M Casas… - … Symposium on High …, 2018 - ieeexplore.ieee.org
The growing complexity of multi-core architectures has motivated a wide range of software
mechanisms to improve the orchestration of parallel executions. Task parallelism has …

Optimizing computation-communication overlap in asynchronous task-based programs

E Castillo, N Jain, M Casas, M Moreto… - Proceedings of the …, 2019 - dl.acm.org
Asynchronous task-based programming models are gaining popularity to address the
programmability and performance challenges in high performance computing. One of the …

Lazy scheduling: A runtime adaptive scheduler for declarative parallelism

A Tzannes, GC Caragea, U Vishkin… - ACM Transactions on …, 2014 - dl.acm.org
Lazy scheduling is a runtime scheduler for task-parallel codes that effectively coarsens
parallelism on load conditions in order to significantly reduce its overheads compared to …