Anatomy of high-performance gemm with online fault tolerance on gpus

S Wu, Y Zhai, J Liu, J Huang, Z Jian, B Wong… - Proceedings of the 37th …, 2023 - dl.acm.org
General Matrix Multiplication (GEMM) is a crucial algorithm for various applications such as
machine learning and scientific computing since an efficient GEMM implementation is …

FT-BLAS: A Fault Tolerant High Performance BLAS Implementation on x86 CPUs

Y Zhai, E Giem, K Zhao, J Liu, J Huang… - … on Parallel and …, 2023 - ieeexplore.ieee.org
Basic Linear Algebra Subprograms (BLAS) serve as a foundational library for scientific
computing and machine learning. In this article, we present a new BLAS implementation, FT …

Kernel fusion in atomistic spin dynamics simulations on Nvidia GPUs using tensor core

H Chen, S Chen, JJ Turner, A Feiguin - Journal of Computational Science, 2024 - Elsevier
In atomistic spin dynamics simulations, the time cost of constructing the space-and time-
displaced pair correlation function in real space increases quadratically as the number of …

[图书][B] Architectural-Aware Performance Optimization: From the Foundational Math Library to Cutting-Edge Applications

Y Zhai - 2023 - search.proquest.com
Efficient performance is essential for deploying a system in the real world. This thesis
presents techniques for optimizing performance with an awareness of architecture for …

Machine Learning and High-Performance Computing in Numerical Simulation of Quantum Many-Body Systems

H Chen - 2024 - search.proquest.com
This thesis focuses on the development, application, and high-performance implementation
of numerical methods for simulating quantum many-body systems. In the first chapter, I …