Combining partial redundancy and checkpointing for HPC
Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10^ 15)
floating point operations per second) and exascale systems are projected within seven …
floating point operations per second) and exascale systems are projected within seven …
A shoulder surfing resistant graphical authentication system
Authentication based on passwords is used largely in applications for computer security and
privacy. However, human actions such as choosing bad passwords and inputting passwords …
privacy. However, human actions such as choosing bad passwords and inputting passwords …
Reliability-aware approach: An incremental checkpoint/restart model in hpc environments
N Naksinehaboon, Y Liu… - 2008 Eighth IEEE …, 2008 - ieeexplore.ieee.org
For full checkpoint on a large-scale HPC system, huge memory contexts must potentially be
transferred through the network and saved in a reliable storage. As such, the time taken to …
transferred through the network and saved in a reliable storage. As such, the time taken to …
Optimizing checkpoint‐based fault‐tolerance in distributed stream processing systems: Theory to practice
S Jayasekara, S Karunasekera… - Software: Practice and …, 2022 - Wiley Online Library
Fault‐tolerance is an essential part of a stream processing system that guarantees data
analysis could continue even after failures. State‐of‐the‐art distributed stream processing …
analysis could continue even after failures. State‐of‐the‐art distributed stream processing …
Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems
Non-volatile devices, such as SSDs, will be an integral part of the deepening storage
hierarchy on large-scale HPC systems. These devices can be on the compute nodes as part …
hierarchy on large-scale HPC systems. These devices can be on the compute nodes as part …
Toward a general theory of optimal checkpoint placement
Checkpoint/restart has been widely used to cope with fail-stop errors. The checkpointing
frequency is most often optimized by assuming an exponential failure distribution. However …
frequency is most often optimized by assuming an exponential failure distribution. However …
Exploring non-volatility of non-volatile memory for high performance computing under failures
Hardware failures and faults often result in application crash in HPC. The emergence of non-
volatile memory (NVM) provides a solution to address this problem. Leveraging the …
volatile memory (NVM) provides a solution to address this problem. Leveraging the …
Understanding practical tradeoffs in HPC checkpoint-scheduling policies
N El-Sayed, B Schroeder - IEEE Transactions on Dependable …, 2016 - ieeexplore.ieee.org
As the scale of High-Performance Computing (HPC) clusters continues to grow, their
increasing failure rates and energy consumption levels are emerging as serious design …
increasing failure rates and energy consumption levels are emerging as serious design …
Checkpoint/restart in practice: When 'simple is better'
N El-Sayed, B Schroeder - 2014 IEEE International Conference …, 2014 - ieeexplore.ieee.org
Efficient use of high-performance computing (HPC) installations critically relies on effective
methods for fault tolerance. The most commonly used method is checkpoint/restart, where …
methods for fault tolerance. The most commonly used method is checkpoint/restart, where …
An optimal checkpointing model with online OCI adjustment for stream processing applications
Y Zhuang, X Wei, H Li, Y Wang… - 2018 27th International …, 2018 - ieeexplore.ieee.org
Checkpoint-based fault tolerant method has been widely used to enhance the reliability of
Distributed Stream Processing Engines (DSPEs), but a checkpointing process usually …
Distributed Stream Processing Engines (DSPEs), but a checkpointing process usually …