Xkaapi: A runtime system for data-flow task programming on heterogeneous architectures

T Gautier, JVF Lima, N Maillard… - 2013 IEEE 27th …, 2013 - ieeexplore.ieee.org
Most recent HPC platforms have heterogeneous nodes composed of multi-core CPUs and
accelerators, like GPUs. Programming such nodes is typically based on a combination of …

A survey of recent developments in parallel implementations of Gaussian elimination

S Donfack, J Dongarra, M Faverge… - Concurrency and …, 2015 - Wiley Online Library
Gaussian elimination is a canonical linear algebra procedure for solving linear systems of
equations. In the last few years, the algorithm has received a lot of attention in an attempt to …

Tools for power-energy modelling and analysis of parallel scientific applications

P Alonso, RM Badia, J Labarta… - 2012 41st …, 2012 - ieeexplore.ieee.org
Understanding power usage in parallel workloads is crucial to develop the energy-aware
software that will run in future Exascale systems. In this paper, we contribute towards this …

Implementing multifrontal sparse solvers for multicore architectures with sequential task flow runtime systems

E Agullo, A Buttari, A Guermouche… - Acm transactions on …, 2016 - dl.acm.org
To face the advent of multicore processors and the ever increasing complexity of hardware
architectures, programming models based on DAG parallelism regained popularity in the …

Parallel programming models for dense linear algebra on heterogeneous systems

J Dongarra, M Abalenkovs, A Abdelfattah… - Supercomputing …, 2015 - superfri.susu.ru
We present a review of the current best practices in parallel programming models for dense
linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand …

Porting the PLASMA numerical library to the OpenMP standard

A YarKhan, J Kurzak, P Luszczek… - International Journal of …, 2017 - Springer
PLASMA is a numerical library intended as a successor to LAPACK for solving problems in
dense linear algebra on multicore processors. PLASMA relies on the QUARK scheduler for …

Improving performance of GMRES by reducing communication and pipelining global collectives

I Yamazaki, M Hoemmen, P Luszczek… - 2017 IEEE …, 2017 - ieeexplore.ieee.org
We compare the performance of pipelined and s-step GMRES, respectively referred to as l-
GMRES and s-GMRES, on distributed multicore CPUs. Compared to standard GMRES, s …

libKOMP, an Efficient OpenMP Runtime System for Both Fork-Join and Data Flow Paradigms

F Broquedis, T Gautier, V Danjean - … on OpenMP, IWOMP 2012, Rome, Italy …, 2012 - Springer
To efficiently exploit high performance computing platforms, applications currently have to
express more and more finer-grain parallelism. The OpenMP standard allows programmers …

Multifrontal QR factorization for multicore architectures over runtime systems

E Agullo, A Buttari, A Guermouche, F Lopez - Euro-Par 2013 Parallel …, 2013 - Springer
To face the advent of multicore processors and the ever increasing complexity of hardware
architectures, programming models based on DAG parallelism regained popularity in the …

A lightweight OpenMP4 run-time for embedded systems

RE Vargas, S Royuela, MA Serrano… - 2016 21st Asia and …, 2016 - ieeexplore.ieee.org
OpenMP is increasingly being adopted by current many-core embedded processors to
exploit their parallel computation capabilities. Unfortunately, current run-time …