Xkaapi: A runtime system for data-flow task programming on heterogeneous architectures
Most recent HPC platforms have heterogeneous nodes composed of multi-core CPUs and
accelerators, like GPUs. Programming such nodes is typically based on a combination of …
accelerators, like GPUs. Programming such nodes is typically based on a combination of …
A survey of recent developments in parallel implementations of Gaussian elimination
Gaussian elimination is a canonical linear algebra procedure for solving linear systems of
equations. In the last few years, the algorithm has received a lot of attention in an attempt to …
equations. In the last few years, the algorithm has received a lot of attention in an attempt to …
Tools for power-energy modelling and analysis of parallel scientific applications
Understanding power usage in parallel workloads is crucial to develop the energy-aware
software that will run in future Exascale systems. In this paper, we contribute towards this …
software that will run in future Exascale systems. In this paper, we contribute towards this …
Implementing multifrontal sparse solvers for multicore architectures with sequential task flow runtime systems
To face the advent of multicore processors and the ever increasing complexity of hardware
architectures, programming models based on DAG parallelism regained popularity in the …
architectures, programming models based on DAG parallelism regained popularity in the …
Parallel programming models for dense linear algebra on heterogeneous systems
We present a review of the current best practices in parallel programming models for dense
linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand …
linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand …
Porting the PLASMA numerical library to the OpenMP standard
PLASMA is a numerical library intended as a successor to LAPACK for solving problems in
dense linear algebra on multicore processors. PLASMA relies on the QUARK scheduler for …
dense linear algebra on multicore processors. PLASMA relies on the QUARK scheduler for …
Improving performance of GMRES by reducing communication and pipelining global collectives
We compare the performance of pipelined and s-step GMRES, respectively referred to as l-
GMRES and s-GMRES, on distributed multicore CPUs. Compared to standard GMRES, s …
GMRES and s-GMRES, on distributed multicore CPUs. Compared to standard GMRES, s …
libKOMP, an Efficient OpenMP Runtime System for Both Fork-Join and Data Flow Paradigms
F Broquedis, T Gautier, V Danjean - … on OpenMP, IWOMP 2012, Rome, Italy …, 2012 - Springer
To efficiently exploit high performance computing platforms, applications currently have to
express more and more finer-grain parallelism. The OpenMP standard allows programmers …
express more and more finer-grain parallelism. The OpenMP standard allows programmers …
Multifrontal QR factorization for multicore architectures over runtime systems
To face the advent of multicore processors and the ever increasing complexity of hardware
architectures, programming models based on DAG parallelism regained popularity in the …
architectures, programming models based on DAG parallelism regained popularity in the …
A lightweight OpenMP4 run-time for embedded systems
RE Vargas, S Royuela, MA Serrano… - 2016 21st Asia and …, 2016 - ieeexplore.ieee.org
OpenMP is increasingly being adopted by current many-core embedded processors to
exploit their parallel computation capabilities. Unfortunately, current run-time …
exploit their parallel computation capabilities. Unfortunately, current run-time …