NVIDIA GPUs scalability to solve multiple (batch) tridiagonal systems implementation of cuthomasb...

Ten lessons we have learned in the new" sparseland": A short handbook for sparse neural network researchers

S Liu, Z Wang - arXiv preprint arXiv:2302.02596, 2023 - arxiv.org

This article does not propose any novel algorithm or new hardware for sparsity. Instead, it
aims to serve the" common good" for the increasingly prosperous Sparse Neural Network …

被引用次数：20 相关文章所有 2 个版本

[PDF] osti.gov

LaRIS: targeting portability and productivity for lapack codes on extreme heterogeneous systems by using iris

MAH Monil, NR Miniskar, FY Liu… - 2022 IEEE/ACM …, 2022 - ieeexplore.ieee.org

In keeping with the trend of heterogeneity in high-performance computing, hardware
manufacturers and vendors are developing new architectures and associated software …

被引用次数：12 相关文章所有 6 个版本

[PDF] upc.edu

cuThomasBatch and cuThomasVBatch, CUDA routines to compute batch of tridiagonal systems on NVIDIA GPUs

P Valero‐Lara, I Martínez‐Pérez… - Concurrency and …, 2018 - Wiley Online Library

The solving of tridiagonal systems is one of the most computationally expensive parts in
many applications, so that multiple studies have explored the use of NVIDIA GPUs to …

被引用次数：33 相关文章所有 4 个版本

[PDF] osti.gov

MatRIS: multi-level math library abstraction for heterogeneity and performance portability using IRIS runtime

MAH Monil, NR Miniskar, K Teranishi… - Proceedings of the SC' …, 2023 - dl.acm.org

Vendor libraries are tuned for a specific architecture and are not portable to others.
Moreover, they lack support for heterogeneity and multi-device orchestration, which is …

被引用次数：8 相关文章

[PDF] arxiv.org

Quantifying Overheads in Charm`++` and HPX Using Task Bench

N Wu, I Gonidelis, S Liu, Z Fink, N Gupta… - … Conference on Parallel …, 2022 - Springer

Abstract Asynchronous Many-Task (AMT) runtime systems take advantage of multi-core
architectures with light-weight threads, asynchronous executions, and smart scheduling. In …

被引用次数：7 相关文章所有 7 个版本

[PDF] upc.edu

sLASs: A fully automatic auto-tuned linear algebra library based on OpenMP extensions implemented in OmpSs (LASs Library)

P Valero-Lara, S Catalán, X Martorell, T Usui… - Journal of Parallel and …, 2020 - Elsevier

In this work we have implemented a novel Linear Algebra Library on top of the task-based
runtime OmpSs-2. We have used some of the most advanced OmpSs-2 features; weak …

被引用次数：17 相关文章所有 4 个版本

[PDF] acm.org Full View

MemHC: an optimized GPU memory management framework for accelerating many-body correlation

Q Wang, Z Peng, B Ren, J Chen… - ACM Transactions on …, 2022 - dl.acm.org

The many-body correlation function is a fundamental computation kernel in modern physics
computing applications, eg, Hadron Contractions in Lattice quantum chromodynamics …

被引用次数：6 相关文章所有 5 个版本

[PDF] upc.edu

Variable batched DGEMM

P Valero-Lara, I Martínez-Pérez… - 2018 26th Euromicro …, 2018 - ieeexplore.ieee.org

Many scientific applications are in need to solve a high number of small-size independent
problems. These individual problems do not provide enough parallelism and then, these …

被引用次数：21 相关文章所有 5 个版本

[PDF] ieee.org

A fast solver for large tridiagonal systems on multi-core processors (lass library)

P Valero-Lara, D Andrade, R Sirvent, J Labarta… - IEEE …, 2019 - ieeexplore.ieee.org

Many problems of industrial and scientific interest require the solving of tridiagonal linear
systems. This paper presents several implementations for the parallel solving of large …

被引用次数：15 相关文章所有 5 个版本

[PDF] arxiv.org

MPI+ OpenMP tasking scalability for multi-morphology simulations of the human brain

P Valero-Lara, R Sirvent, AJ Peña, J Labarta - Parallel Computing, 2019 - Elsevier

The simulation of the behavior of the human brain is one of the most ambitious challenges
today with a non-end of important applications. We can find many different initiatives in the …

被引用次数：14 相关文章所有 7 个版本