Anatomy of high-performance gemm with online fault tolerance on gpus
General Matrix Multiplication (GEMM) is a crucial algorithm for various applications such as
machine learning and scientific computing since an efficient GEMM implementation is …
machine learning and scientific computing since an efficient GEMM implementation is …
FT-BLAS: A Fault Tolerant High Performance BLAS Implementation on x86 CPUs
Basic Linear Algebra Subprograms (BLAS) serve as a foundational library for scientific
computing and machine learning. In this article, we present a new BLAS implementation, FT …
computing and machine learning. In this article, we present a new BLAS implementation, FT …
Kernel fusion in atomistic spin dynamics simulations on Nvidia GPUs using tensor core
In atomistic spin dynamics simulations, the time cost of constructing the space-and time-
displaced pair correlation function in real space increases quadratically as the number of …
displaced pair correlation function in real space increases quadratically as the number of …
[图书][B] Architectural-Aware Performance Optimization: From the Foundational Math Library to Cutting-Edge Applications
Y Zhai - 2023 - search.proquest.com
Efficient performance is essential for deploying a system in the real world. This thesis
presents techniques for optimizing performance with an awareness of architecture for …
presents techniques for optimizing performance with an awareness of architecture for …
Machine Learning and High-Performance Computing in Numerical Simulation of Quantum Many-Body Systems
H Chen - 2024 - search.proquest.com
This thesis focuses on the development, application, and high-performance implementation
of numerical methods for simulating quantum many-body systems. In the first chapter, I …
of numerical methods for simulating quantum many-body systems. In the first chapter, I …