Online algorithm-based fault tolerance for cholesky decomposition on heterogeneous systems with gpus
Extensive researches have been done on developing and optimizing algorithm-based fault
tolerance (ABFT) schemes for systolic arrays and general purpose microprocessors …
tolerance (ABFT) schemes for systolic arrays and general purpose microprocessors …
Transparently resilient task parallelism for Chapel
K Panagiotopoulou, HW Loidl - 2016 IEEE International Parallel …, 2016 - ieeexplore.ieee.org
Hardware failure in High-Performance Computing systems is the norm. Failure data,
collected over a nine year period across 22 large-scale systems of up to few thousands of …
collected over a nine year period across 22 large-scale systems of up to few thousands of …