APOGEE: Adaptive prefetching on GPUs for energy efficiency

A Sethia, G Dasika, M Samadi… - Proceedings of the 22nd …, 2013 - ieeexplore.ieee.org
Modern graphics processing units (GPUs) combine large amounts of parallel hardware with
fast context switching among thousands of active threads to achieve high performance …

Resiliency in numerical algorithm design for extreme scale simulations

E Agullo, M Altenbernd, H Anzt… - … Journal of High …, 2022 - journals.sagepub.com
This work is based on the seminar titled 'Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …

A method to represent multiple-output switching functions by using multi-valued decision diagrams

T Sasao, JT Butler - … of 26th IEEE International Symposium on …, 1996 - ieeexplore.ieee.org
Multiple-output switching functions can be simulated by multiple-valued decision diagrams
(MDDs) at a significant reduction in computation time. analyze the following approaches to …

Partial redundancy in hpc systems with non-uniform node reliabilities

Z Hussain, T Znati, R Melhem - SC18: International Conference …, 2018 - ieeexplore.ieee.org
We study the usefulness of partial redundancy in HPC message passing systems where
individual node failure distributions are not identical. Prior research works on fault tolerance …

Replication is more efficient than you think

A Benoit, T Herault, VL Fèvre, Y Robert - Proceedings of the International …, 2019 - dl.acm.org
This paper revisits replication coupled with checkpointing for fail-stop errors. Replication
enables the application to survive many fail-stop errors, thereby allowing for longer …

Fault tolerance on large scale systems using adaptive process replication

C George, S Vadhiyar - IEEE Transactions on Computers, 2014 - ieeexplore.ieee.org
Exascale systems of the future are predicted to have mean time between failures (MTBF) of
less than one hour. At such low MTBFs, employing periodic checkpointing alone will result in …

Redthreads: An interface for application-level fault detection/correction through adaptive redundant multithreading

S Hukerikar, K Teranishi, PC Diniz… - International Journal of …, 2018 - Springer
In the presence of accelerated fault rates, which are projected to be the norm on future
exascale systems, it will become increasingly difficult for high-performance computing (HPC) …

Opportunistic application-level fault detection through adaptive redundant multithreading

S Hukerikar, PC Diniz, RF Lucas… - … Conference on High …, 2014 - ieeexplore.ieee.org
As the scale and complexity of future High Performance Computing systems continues to
grow, the rising frequency of faults and errors and their impact on HPC applications will …

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

A Benoit, A Cavelan, F Cappello, P Raghavan… - Journal of Parallel and …, 2018 - Elsevier
This paper provides a model and an analytical study of replication as a technique to cope
with silent errors, as well as a mixture of both silent and fail-stop errors on large-scale …

Adaptive and power-aware resilience for extreme-scale computing

X Cui, T Znati, R Melhem - … and Big Data Computing, Internet of …, 2016 - ieeexplore.ieee.org
With concerted efforts from researchers in hardware, software, algorithm, resource
management, HPC is moving towards extreme-scale, featuring a computing capability of …