P-cloth: interactive complex cloth simulation on multi-GPU systems using dynamic matrix assembly and pipelined implicit integrators
We present a novel parallel algorithm for cloth simulation that exploits multiple GPUs for fast
computation and the handling of very high resolution meshes. To accelerate implicit …
computation and the handling of very high resolution meshes. To accelerate implicit …
Panda: A Compiler Framework for Concurrent CPUGPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers
We present a new compiler framework for truly heterogeneous 3D stencil computation on
GPU clusters. Our framework consists of a simple directive-based programming model and a …
GPU clusters. Our framework consists of a simple directive-based programming model and a …
Towards fine-grained dynamic tuning of HPC applications on modern multi-core architectures
There is a consensus that exascale systems should operate within a power envelope of
20MW. Consequently, energy conservation is still considered as the most crucial constraint if …
20MW. Consequently, energy conservation is still considered as the most crucial constraint if …
CUDAMicroBench: Microbenchmarks to Assist CUDA Performance Programming
Programming to achieve high performance for NVIDIA GPUs using CUDA has been known
to be challenging. A GPU has hundreds or thousands of cores that a program must exhibit …
to be challenging. A GPU has hundreds or thousands of cores that a program must exhibit …
A Study of Performance Programming of CPU, GPU accelerated Computers and SIMD Architecture
X Yi - arXiv preprint arXiv:2409.10661, 2024 - arxiv.org
Parallel computing is a standard approach to achieving high-performance computing (HPC).
Three commonly used methods to implement parallel computing include: 1) applying …
Three commonly used methods to implement parallel computing include: 1) applying …
Heterogeneous cpu-gpu execution of stencil applications
B Siklosi, IZ Reguly… - 2018 IEEE/ACM …, 2018 - ieeexplore.ieee.org
Heterogeneous computer architectures are now ubiquitous in high performance computing;
the top 7 supercomputers are all built with CPUs and accelerators. Portability across …
the top 7 supercomputers are all built with CPUs and accelerators. Portability across …
An efficient GPU implementation and scaling for higher-order 3D stencils
Stencil computation patterns are the backbone of many scientific and engineering
simulations. The stencil computation is known to be constrained by its high demand of …
simulations. The stencil computation is known to be constrained by its high demand of …
Node-aware stencil communication for heterogeneous supercomputers
High-performance distributed computing systems increasingly feature nodes that have
multiple CPU sockets and multiple GPUs. The communication bandwidth between these …
multiple CPU sockets and multiple GPUs. The communication bandwidth between these …
On Generating Out-Of-Core GPU Code for Multi-Dimensional Array Operations
P van Beurden, SB Scholz - Proceedings of the 34th Symposium on …, 2022 - dl.acm.org
This paper presents the first results of our experiments for generating CUDA code that
streams array operations over the elements of its array arguments from high-level …
streams array operations over the elements of its array arguments from high-level …
EPSILOD: efficient parallel skeleton for generic iterative stencil computations in distributed GPUs
Iterative stencil computations are widely used in numerical simulations. They present a high
degree of parallelism, high locality and mostly-coalesced memory access patterns …
degree of parallelism, high locality and mostly-coalesced memory access patterns …