Understanding data movement in tightly coupled heterogeneous systems: A case study with the Grace Hopper superchip
Heterogeneous supercomputers have become the standard in HPC. GPUs in particular
have dominated the accelerator landscape, offering unprecedented performance in parallel …
have dominated the accelerator landscape, offering unprecedented performance in parallel …
Revisiting Temporal Blocking Stencil Optimizations
Iterative stencils are used widely across the spectrum of High Performance Computing
(HPC) applications. Many efforts have been put into optimizing stencil GPU kernels, given …
(HPC) applications. Many efforts have been put into optimizing stencil GPU kernels, given …
Retargeting and Respecializing GPU Workloads for Performance Portability
In order to come close to peak performance, accelerators like GPUs require significant
architecture-specific tuning that understand the availability of shared memory, parallelism …
architecture-specific tuning that understand the availability of shared memory, parallelism …
ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor Cores
Tensor Core Unit (TCU) is increasingly integrated into modern high-performance processors
to enhance matrix multiplication performance. However, constrained to its over-specification …
to enhance matrix multiplication performance. However, constrained to its over-specification …
[PDF][PDF] GPU-Centric Communication Schemes: When CPUs Take a Back Seat
I Ismayilov - 2023 - parcorelab.ku.edu.tr
In recent years, GPUs have become the leading accelerator in modern high-performance
systems such that much of HPC computational capability has concentrated in clusters of …
systems such that much of HPC computational capability has concentrated in clusters of …
Accelerating an overhead-sensitive atmospheric model on GPUs using asynchronous execution and kernel fusion
K Yamazaki - conferences.computer.org
Methods to mitigate the kernel launch overhead, one of drawbacks of GPUs, were
implemented to an overheadsensitive atmospheric model using OpenACC and CUDA and …
implemented to an overheadsensitive atmospheric model using OpenACC and CUDA and …