Combining partial redundancy and checkpointing for HPC

J Elliott, K Kharbas, D Fiala, F Mueller… - 2012 IEEE 32nd …, 2012 - ieeexplore.ieee.org
Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10^ 15)
floating point operations per second) and exascale systems are projected within seven …

A shoulder surfing resistant graphical authentication system

HM Sun, ST Chen, JH Yeh… - IEEE Transactions on …, 2016 - ieeexplore.ieee.org
Authentication based on passwords is used largely in applications for computer security and
privacy. However, human actions such as choosing bad passwords and inputting passwords …

Reliability-aware approach: An incremental checkpoint/restart model in hpc environments

N Naksinehaboon, Y Liu… - 2008 Eighth IEEE …, 2008 - ieeexplore.ieee.org
For full checkpoint on a large-scale HPC system, huge memory contexts must potentially be
transferred through the network and saved in a reliable storage. As such, the time taken to …

Optimizing checkpoint‐based fault‐tolerance in distributed stream processing systems: Theory to practice

S Jayasekara, S Karunasekera… - Software: Practice and …, 2022 - Wiley Online Library
Fault‐tolerance is an essential part of a stream processing system that guarantees data
analysis could continue even after failures. State‐of‐the‐art distributed stream processing …

Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems

L Wan, Q Cao, F Wang, S Oral - Journal of Parallel and Distributed …, 2017 - Elsevier
Non-volatile devices, such as SSDs, will be an integral part of the deepening storage
hierarchy on large-scale HPC systems. These devices can be on the compute nodes as part …

Toward a general theory of optimal checkpoint placement

O Subasi, G Kestor… - 2017 IEEE International …, 2017 - ieeexplore.ieee.org
Checkpoint/restart has been widely used to cope with fail-stop errors. The checkpointing
frequency is most often optimized by assuming an exponential failure distribution. However …

Exploring non-volatility of non-volatile memory for high performance computing under failures

J Ren, K Wu, D Li - 2020 IEEE International Conference on …, 2020 - ieeexplore.ieee.org
Hardware failures and faults often result in application crash in HPC. The emergence of non-
volatile memory (NVM) provides a solution to address this problem. Leveraging the …

Understanding practical tradeoffs in HPC checkpoint-scheduling policies

N El-Sayed, B Schroeder - IEEE Transactions on Dependable …, 2016 - ieeexplore.ieee.org
As the scale of High-Performance Computing (HPC) clusters continues to grow, their
increasing failure rates and energy consumption levels are emerging as serious design …

Checkpoint/restart in practice: When 'simple is better'

N El-Sayed, B Schroeder - 2014 IEEE International Conference …, 2014 - ieeexplore.ieee.org
Efficient use of high-performance computing (HPC) installations critically relies on effective
methods for fault tolerance. The most commonly used method is checkpoint/restart, where …

An optimal checkpointing model with online OCI adjustment for stream processing applications

Y Zhuang, X Wei, H Li, Y Wang… - 2018 27th International …, 2018 - ieeexplore.ieee.org
Checkpoint-based fault tolerant method has been widely used to enhance the reliability of
Distributed Stream Processing Engines (DSPEs), but a checkpointing process usually …