Does partial replication pay off?

A Sethia, G Dasika, M Samadi… - Proceedings of the 22nd …, 2013 - ieeexplore.ieee.org

Modern graphics processing units (GPUs) combine large amounts of parallel hardware with
fast context switching among thousands of active threads to achieve high performance …

被引用次数：94 相关文章所有 11 个版本

[PDF] arxiv.org

Resiliency in numerical algorithm design for extreme scale simulations

E Agullo, M Altenbernd, H Anzt… - … Journal of High …, 2022 - journals.sagepub.com

This work is based on the seminar titled 'Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …

被引用次数：9 相关文章所有 22 个版本

[PDF] dtic.mil

A method to represent multiple-output switching functions by using multi-valued decision diagrams

T Sasao, JT Butler - … of 26th IEEE International Symposium on …, 1996 - ieeexplore.ieee.org

Multiple-output switching functions can be simulated by multiple-valued decision diagrams
(MDDs) at a significant reduction in computation time. analyze the following approaches to …

被引用次数：96 相关文章所有 15 个版本

[PDF] pitt.edu

Partial redundancy in hpc systems with non-uniform node reliabilities

Z Hussain, T Znati, R Melhem - SC18: International Conference …, 2018 - ieeexplore.ieee.org

We study the usefulness of partial redundancy in HPC message passing systems where
individual node failure distributions are not identical. Prior research works on fault tolerance …

被引用次数：23 相关文章所有 7 个版本

[PDF] acm.org

Replication is more efficient than you think

A Benoit, T Herault, VL Fèvre, Y Robert - Proceedings of the International …, 2019 - dl.acm.org

This paper revisits replication coupled with checkpointing for fail-stop errors. Replication
enables the application to survive many fail-stop errors, thereby allowing for longer …

被引用次数：18 相关文章所有 17 个版本

[PDF] iisc.ac.in

Fault tolerance on large scale systems using adaptive process replication

C George, S Vadhiyar - IEEE Transactions on Computers, 2014 - ieeexplore.ieee.org

Exascale systems of the future are predicted to have mean time between failures (MTBF) of
less than one hour. At such low MTBFs, employing periodic checkpointing alone will result in …

被引用次数：28 相关文章所有 9 个版本

[PDF] arxiv.org

Redthreads: An interface for application-level fault detection/correction through adaptive redundant multithreading

S Hukerikar, K Teranishi, PC Diniz… - International Journal of …, 2018 - Springer

In the presence of accelerated fault rates, which are projected to be the norm on future
exascale systems, it will become increasingly difficult for high-performance computing (HPC) …

被引用次数：21 相关文章所有 10 个版本

Opportunistic application-level fault detection through adaptive redundant multithreading

S Hukerikar, PC Diniz, RF Lucas… - … Conference on High …, 2014 - ieeexplore.ieee.org

As the scale and complexity of future High Performance Computing systems continues to
grow, the rising frequency of faults and errors and their impact on HPC applications will …

被引用次数：27 相关文章所有 2 个版本

[PDF] sciencedirect.com

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

A Benoit, A Cavelan, F Cappello, P Raghavan… - Journal of Parallel and …, 2018 - Elsevier

This paper provides a model and an analytical study of replication as a technique to cope
with silent errors, as well as a mixture of both silent and fail-stop errors on large-scale …

被引用次数：19 相关文章所有 16 个版本

[PDF] pitt.edu

Adaptive and power-aware resilience for extreme-scale computing

X Cui, T Znati, R Melhem - … and Big Data Computing, Internet of …, 2016 - ieeexplore.ieee.org

With concerted efforts from researchers in hardware, software, algorithm, resource
management, HPC is moving towards extreme-scale, featuring a computing capability of …

被引用次数：21 相关文章所有 12 个版本