Ten lessons we have learned in the new" sparseland": A short handbook for sparse neural network researchers

S Liu, Z Wang - arXiv preprint arXiv:2302.02596, 2023 - arxiv.org
This article does not propose any novel algorithm or new hardware for sparsity. Instead, it
aims to serve the" common good" for the increasingly prosperous Sparse Neural Network …

LaRIS: targeting portability and productivity for lapack codes on extreme heterogeneous systems by using iris

MAH Monil, NR Miniskar, FY Liu… - 2022 IEEE/ACM …, 2022 - ieeexplore.ieee.org
In keeping with the trend of heterogeneity in high-performance computing, hardware
manufacturers and vendors are developing new architectures and associated software …

cuThomasBatch and cuThomasVBatch, CUDA routines to compute batch of tridiagonal systems on NVIDIA GPUs

P Valero‐Lara, I Martínez‐Pérez… - Concurrency and …, 2018 - Wiley Online Library
The solving of tridiagonal systems is one of the most computationally expensive parts in
many applications, so that multiple studies have explored the use of NVIDIA GPUs to …

MatRIS: multi-level math library abstraction for heterogeneity and performance portability using IRIS runtime

MAH Monil, NR Miniskar, K Teranishi… - Proceedings of the SC' …, 2023 - dl.acm.org
Vendor libraries are tuned for a specific architecture and are not portable to others.
Moreover, they lack support for heterogeneity and multi-device orchestration, which is …

Quantifying Overheads in Charm++ and HPX Using Task Bench

N Wu, I Gonidelis, S Liu, Z Fink, N Gupta… - … Conference on Parallel …, 2022 - Springer
Abstract Asynchronous Many-Task (AMT) runtime systems take advantage of multi-core
architectures with light-weight threads, asynchronous executions, and smart scheduling. In …

sLASs: A fully automatic auto-tuned linear algebra library based on OpenMP extensions implemented in OmpSs (LASs Library)

P Valero-Lara, S Catalán, X Martorell, T Usui… - Journal of Parallel and …, 2020 - Elsevier
In this work we have implemented a novel Linear Algebra Library on top of the task-based
runtime OmpSs-2. We have used some of the most advanced OmpSs-2 features; weak …

MemHC: an optimized GPU memory management framework for accelerating many-body correlation

Q Wang, Z Peng, B Ren, J Chen… - ACM Transactions on …, 2022 - dl.acm.org
The many-body correlation function is a fundamental computation kernel in modern physics
computing applications, eg, Hadron Contractions in Lattice quantum chromodynamics …

Variable batched DGEMM

P Valero-Lara, I Martínez-Pérez… - 2018 26th Euromicro …, 2018 - ieeexplore.ieee.org
Many scientific applications are in need to solve a high number of small-size independent
problems. These individual problems do not provide enough parallelism and then, these …

A fast solver for large tridiagonal systems on multi-core processors (lass library)

P Valero-Lara, D Andrade, R Sirvent, J Labarta… - IEEE …, 2019 - ieeexplore.ieee.org
Many problems of industrial and scientific interest require the solving of tridiagonal linear
systems. This paper presents several implementations for the parallel solving of large …

MPI+ OpenMP tasking scalability for multi-morphology simulations of the human brain

P Valero-Lara, R Sirvent, AJ Peña, J Labarta - Parallel Computing, 2019 - Elsevier
The simulation of the behavior of the human brain is one of the most ambitious challenges
today with a non-end of important applications. We can find many different initiatives in the …