Improving GPGPU concurrency with elastic kernels
S Pai, MJ Thazhuthaveetil… - ACM SIGARCH Computer …, 2013 - dl.acm.org
Each new generation of GPUs vastly increases the resources available to GPGPU
programs. GPU programming models (like CUDA) were designed to scale to use these …
programs. GPU programming models (like CUDA) were designed to scale to use these …
Habanero-Java: the new adventures of old X10
In this paper, we present the Habanero-Java (HJ) language developed at Rice University as
an extension to the original Java-based definition of the X10 language. HJ includes a …
an extension to the original Java-based definition of the X10 language. HJ includes a …
Software challenges in extreme scale systems
V Sarkar, W Harrod, AE Snavely - Journal of Physics: Conference …, 2009 - iopscience.iop.org
Computer systems anticipated in the 2015–2020 timeframe are referred to as Extreme Scale
because they will be built using massive multi-core processors with 100's of cores per chip …
because they will be built using massive multi-core processors with 100's of cores per chip …
[PDF][PDF] Exascale software study: Software challenges in extreme scale systems
S Amarasinghe, D Campbell, W Carlson… - DARPA IPTO, Air Force …, 2009 - Citeseer
Extreme Scale processors containing hundreds or even thousands of cores will challenge
current operating system (OS) practices. Many of the fundamental assumptions that underlie …
current operating system (OS) practices. Many of the fundamental assumptions that underlie …
Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs
JA Stratton, V Grover, J Marathe, B Aarts… - Proceedings of the 8th …, 2010 - dl.acm.org
In this paper we describe techniques for compiling fine-grained SPMD-threaded programs,
expressed in programming models such as OpenCL or CUDA, to multicore execution …
expressed in programming models such as OpenCL or CUDA, to multicore execution …
Intermediate representations for explicitly parallel programs
A Susungi, C Tadonki - ACM Computing Surveys (CSUR), 2021 - dl.acm.org
While compilers generally support parallel programming languages and APIs, their internal
program representations are mostly designed from the sequential programs standpoint …
program representations are mostly designed from the sequential programs standpoint …
COX: Exposing CUDA warp-level functions to CPUs
As CUDA becomes the de facto programming language among data parallel applications
such as high-performance computing or machine learning applications, running CUDA on …
such as high-performance computing or machine learning applications, running CUDA on …
Architectural support for task dependence management with flexible software scheduling
The growing complexity of multi-core architectures has motivated a wide range of software
mechanisms to improve the orchestration of parallel executions. Task parallelism has …
mechanisms to improve the orchestration of parallel executions. Task parallelism has …
Optimizing computation-communication overlap in asynchronous task-based programs
Asynchronous task-based programming models are gaining popularity to address the
programmability and performance challenges in high performance computing. One of the …
programmability and performance challenges in high performance computing. One of the …
Lazy scheduling: A runtime adaptive scheduler for declarative parallelism
Lazy scheduling is a runtime scheduler for task-parallel codes that effectively coarsens
parallelism on load conditions in order to significantly reduce its overheads compared to …
parallelism on load conditions in order to significantly reduce its overheads compared to …