A case for malleable thread-level linear algebra libraries: The LU factorization with partial...

X Xuan, Q Liao - … Conference on Image and Graphics (ICIG …, 2007 - ieeexplore.ieee.org

Automated MRI (Magnetic Resonance Imaging) brain tumor segmentation is a difficult task
due to the variance and complexity of tumors. In this paper, a statistical structure analysis …

被引用次数：108 相关文章所有 6 个版本

[PDF] osti.gov

LaRIS: targeting portability and productivity for lapack codes on extreme heterogeneous systems by using iris

MAH Monil, NR Miniskar, FY Liu… - 2022 IEEE/ACM …, 2022 - ieeexplore.ieee.org

In keeping with the trend of heterogeneity in high-performance computing, hardware
manufacturers and vendors are developing new architectures and associated software …

被引用次数：12 相关文章所有 6 个版本

Adaptive parallel applications: from shared memory architectures to fog computing (2002–2022)

G Galante, R da Rosa Righi - Cluster Computing, 2022 - Springer

The evolution of parallel architectures points to dynamic environments where the number of
available resources or configurations may vary during the execution of applications. This …

被引用次数：8 相关文章所有 3 个版本

[PDF] upc.edu

sLASs: A fully automatic auto-tuned linear algebra library based on OpenMP extensions implemented in OmpSs (LASs Library)

P Valero-Lara, S Catalán, X Martorell, T Usui… - Journal of Parallel and …, 2020 - Elsevier

In this work we have implemented a novel Linear Algebra Library on top of the task-based
runtime OmpSs-2. We have used some of the most advanced OmpSs-2 features; weak …

被引用次数：17 相关文章所有 4 个版本

[PDF] arxiv.org

Machine-Learning-Driven Runtime Optimization of BLAS Level 3 on Modern Multi-Core Systems

Y Xia, GMJ Barca - 2024 IEEE International Parallel and …, 2024 - ieeexplore.ieee.org

BLAS Level 3 operations are essential for scientific computing, but finding the optimal
number of threads for multi-threaded implementations on modern multi-core systems is …

被引用次数：1 相关文章所有 4 个版本

[PDF] sagepub.com

Experiences with nested parallelism in task-parallel applications using malleable BLAS on multicore processors

R Rodríguez-Sánchez, A Castelló… - … Journal of High …, 2024 - journals.sagepub.com

Malleability is defined as the ability to vary the degree of parallelism at runtime, and is
regarded as a means to improve core occupation on state-of-the-art multicore processors …

被引用次数：1 相关文章所有 3 个版本

A Machine Learning Approach Towards Runtime Optimisation of Matrix Multiplication

Y Xia, M De La Pierre, AS Barnard… - 2023 IEEE …, 2023 - ieeexplore.ieee.org

The GEneral Matrix Multiplication (GEMM) is one of the essential algorithms in scientific
computing. Single-thread GEMM implementations are well-optimised with techniques like …

被引用次数：3 相关文章所有 2 个版本

[PDF] unlp.edu.ar

Towards a malleable tensorflow implementation

LA Libutti, FD Igual, L Piñuel, L De Giusti… - Conference on Cloud …, 2020 - Springer

The TensorFlow framework was designed since its inception to provide multi-thread
capabilities, extended with hardware accelerator support to leverage the potential of modern …

被引用次数：6 相关文章所有 6 个版本

Static versus dynamic task scheduling of the lu factorization on ARM big. LITTLE architectures

S Catalán, R Rodríguez-Sánchez… - 2017 IEEE …, 2017 - ieeexplore.ieee.org

We investigate several parallel algorithmic variants of the LU factorization with partial
pivoting (LUpp) that trade off the exploitation of increasing levels of task-parallelism in …

被引用次数：9 相关文章所有 3 个版本

[PDF] arxiv.org

Programming parallel dense matrix factorizations with look-ahead and OpenMP

S Catalán, A Castelló, FD Igual… - Cluster …, 2020 - Springer

We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms,
using OpenMP, that departs from the legacy (or conventional) solution, which simply extracts …

被引用次数：7 相关文章所有 12 个版本