P-cloth: interactive complex cloth simulation on multi-GPU systems using dynamic matrix assembly and pipelined implicit integrators

C Li, M Tang, R Tong, M Cai, J Zhao… - ACM Transactions on …, 2020 - dl.acm.org
We present a novel parallel algorithm for cloth simulation that exploits multiple GPUs for fast
computation and the handling of very high resolution meshes. To accelerate implicit …

Panda: A Compiler Framework for Concurrent CPUGPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

M Sourouri, SB Baden, X Cai - International Journal of Parallel …, 2017 - Springer
We present a new compiler framework for truly heterogeneous 3D stencil computation on
GPU clusters. Our framework consists of a simple directive-based programming model and a …

Towards fine-grained dynamic tuning of HPC applications on modern multi-core architectures

M Sourouri, EB Raknes, N Reissmann… - Proceedings of the …, 2017 - dl.acm.org
There is a consensus that exascale systems should operate within a power envelope of
20MW. Consequently, energy conservation is still considered as the most crucial constraint if …

CUDAMicroBench: Microbenchmarks to Assist CUDA Performance Programming

X Yi, D Stokes, Y Yan, C Liao - 2021 IEEE International Parallel …, 2021 - ieeexplore.ieee.org
Programming to achieve high performance for NVIDIA GPUs using CUDA has been known
to be challenging. A GPU has hundreds or thousands of cores that a program must exhibit …

A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture

X Yi - arXiv preprint arXiv:2409.10661, 2024 - arxiv.org
Parallel computing is a standard approach to achieving high-performance computing (HPC).
Three commonly used methods to implement parallel computing include: 1) applying …

Heterogeneous cpu-gpu execution of stencil applications

B Siklosi, IZ Reguly… - 2018 IEEE/ACM …, 2018 - ieeexplore.ieee.org
Heterogeneous computer architectures are now ubiquitous in high performance computing;
the top 7 supercomputers are all built with CPUs and accelerators. Portability across …

An efficient GPU implementation and scaling for higher-order 3D stencils

O Anjum, M Almasri, SG de Gonzalo, W Hwu - Information Sciences, 2022 - Elsevier
Stencil computation patterns are the backbone of many scientific and engineering
simulations. The stencil computation is known to be constrained by its high demand of …

Node-aware stencil communication for heterogeneous supercomputers

C Pearson, M Hidayetoğlu, M Almasri… - 2020 IEEE …, 2020 - ieeexplore.ieee.org
High-performance distributed computing systems increasingly feature nodes that have
multiple CPU sockets and multiple GPUs. The communication bandwidth between these …

On Generating Out-Of-Core GPU Code for Multi-Dimensional Array Operations

P van Beurden, SB Scholz - Proceedings of the 34th Symposium on …, 2022 - dl.acm.org
This paper presents the first results of our experiments for generating CUDA code that
streams array operations over the elements of its array arguments from high-level …

EPSILOD: efficient parallel skeleton for generic iterative stencil computations in distributed GPUs

M de Castro, I Santamaria-Valenzuela, Y Torres… - The Journal of …, 2023 - Springer
Iterative stencil computations are widely used in numerical simulations. They present a high
degree of parallelism, high locality and mostly-coalesced memory access patterns …