Chunking parallel loops in the presence of synchronization

S Pai, MJ Thazhuthaveetil… - ACM SIGARCH Computer …, 2013 - dl.acm.org

Each new generation of GPUs vastly increases the resources available to GPGPU
programs. GPU programming models (like CUDA) were designed to scale to use these …

被引用次数：291 相关文章所有 10 个版本

[PDF] academia.edu

Habanero-Java: the new adventures of old X10

V Cavé, J Zhao, J Shirako, V Sarkar - Proceedings of the 9th …, 2011 - dl.acm.org

In this paper, we present the Habanero-Java (HJ) language developed at Rice University as
an extension to the original Java-based definition of the X10 language. HJ includes a …

被引用次数：321 相关文章所有 8 个版本

[PDF] iop.org

Software challenges in extreme scale systems

V Sarkar, W Harrod, AE Snavely - Journal of Physics: Conference …, 2009 - iopscience.iop.org

Computer systems anticipated in the 2015–2020 timeframe are referred to as Extreme Scale
because they will be built using massive multi-core processors with 100's of cores per chip …

被引用次数：198 相关文章所有 11 个版本

[PDF] psu.edu

[PDF][PDF] Exascale software study: Software challenges in extreme scale systems

S Amarasinghe, D Campbell, W Carlson… - DARPA IPTO, Air Force …, 2009 - Citeseer

Extreme Scale processors containing hundreds or even thousands of cores will challenge
current operating system (OS) practices. Many of the fundamental assumptions that underlie …

被引用次数：148 相关文章所有 3 个版本

[PDF] illinois.edu

Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs

JA Stratton, V Grover, J Marathe, B Aarts… - Proceedings of the 8th …, 2010 - dl.acm.org

In this paper we describe techniques for compiling fine-grained SPMD-threaded programs,
expressed in programming models such as OpenCL or CUDA, to multicore execution …

被引用次数：122 相关文章所有 9 个版本

Intermediate representations for explicitly parallel programs

A Susungi, C Tadonki - ACM Computing Surveys (CSUR), 2021 - dl.acm.org

While compilers generally support parallel programming languages and APIs, their internal
program representations are mostly designed from the sequential programs standpoint …

被引用次数：8 相关文章所有 3 个版本

[PDF] acm.org Full View

COX: Exposing CUDA warp-level functions to CPUs

R Han, J Lee, J Sim, H Kim - ACM Transactions on Architecture and …, 2022 - dl.acm.org

As CUDA becomes the de facto programming language among data parallel applications
such as high-performance computing or machine learning applications, running CUDA on …

被引用次数：9 相关文章所有 2 个版本

[PDF] upc.edu

Architectural support for task dependence management with flexible software scheduling

E Castillo, L Alvarez, M Moreto, M Casas… - … Symposium on High …, 2018 - ieeexplore.ieee.org

The growing complexity of multi-core architectures has motivated a wide range of software
mechanisms to improve the orchestration of parallel executions. Task parallelism has …

被引用次数：34 相关文章所有 5 个版本

[PDF] osti.gov

Optimizing computation-communication overlap in asynchronous task-based programs

E Castillo, N Jain, M Casas, M Moreto… - Proceedings of the …, 2019 - dl.acm.org

Asynchronous task-based programming models are gaining popularity to address the
programmability and performance challenges in high performance computing. One of the …

被引用次数：24 相关文章所有 7 个版本

[PDF] acm.org

Lazy scheduling: A runtime adaptive scheduler for declarative parallelism

A Tzannes, GC Caragea, U Vishkin… - ACM Transactions on …, 2014 - dl.acm.org

Lazy scheduling is a runtime scheduler for task-parallel codes that effectively coarsens
parallelism on load conditions in order to significantly reduce its overheads compared to …

被引用次数：40 相关文章所有 7 个版本