Online algorithm-based fault tolerance for cholesky decomposition on heterogeneous systems with gpus

J Chen, X Liang, Z Chen - 2016 IEEE International Parallel and …, 2016 - ieeexplore.ieee.org
Extensive researches have been done on developing and optimizing algorithm-based fault
tolerance (ABFT) schemes for systolic arrays and general purpose microprocessors …

Transparently resilient task parallelism for Chapel

K Panagiotopoulou, HW Loidl - 2016 IEEE International Parallel …, 2016 - ieeexplore.ieee.org
Hardware failure in High-Performance Computing systems is the norm. Failure data,
collected over a nine year period across 22 large-scale systems of up to few thousands of …