A reliability-aware approach for an optimal checkpoint/restart model in hpc environments

J Elliott, K Kharbas, D Fiala, F Mueller… - 2012 IEEE 32nd …, 2012 - ieeexplore.ieee.org

Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10^ 15)
floating point operations per second) and exascale systems are projected within seven …

被引用次数：206 相关文章所有 20 个版本

[PDF] boisestate.edu

A shoulder surfing resistant graphical authentication system

HM Sun, ST Chen, JH Yeh… - IEEE Transactions on …, 2016 - ieeexplore.ieee.org

Authentication based on passwords is used largely in applications for computer security and
privacy. However, human actions such as choosing bad passwords and inputting passwords …

被引用次数：143 相关文章所有 11 个版本

Reliability-aware approach: An incremental checkpoint/restart model in hpc environments

N Naksinehaboon, Y Liu… - 2008 Eighth IEEE …, 2008 - ieeexplore.ieee.org

For full checkpoint on a large-scale HPC system, huge memory contexts must potentially be
transferred through the network and saved in a reliable storage. As such, the time taken to …

被引用次数：106 相关文章所有 6 个版本

[PDF] unimelb.edu.au

Optimizing checkpoint‐based fault‐tolerance in distributed stream processing systems: Theory to practice

S Jayasekara, S Karunasekera… - Software: Practice and …, 2022 - Wiley Online Library

Fault‐tolerance is an essential part of a stream processing system that guarantees data
analysis could continue even after failures. State‐of‐the‐art distributed stream processing …

被引用次数：10 相关文章所有 2 个版本

[PDF] sciencedirect.com

Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems

L Wan, Q Cao, F Wang, S Oral - Journal of Parallel and Distributed …, 2017 - Elsevier

Non-volatile devices, such as SSDs, will be an integral part of the deepening storage
hierarchy on large-scale HPC systems. These devices can be on the compute nodes as part …

被引用次数：30 相关文章所有 7 个版本

Toward a general theory of optimal checkpoint placement

O Subasi, G Kestor… - 2017 IEEE International …, 2017 - ieeexplore.ieee.org

Checkpoint/restart has been widely used to cope with fail-stop errors. The checkpointing
frequency is most often optimized by assuming an exponential failure distribution. However …

被引用次数：24 相关文章所有 2 个版本

[PDF] pasalabs.org

Exploring non-volatility of non-volatile memory for high performance computing under failures

J Ren, K Wu, D Li - 2020 IEEE International Conference on …, 2020 - ieeexplore.ieee.org

Hardware failures and faults often result in application crash in HPC. The emergence of non-
volatile memory (NVM) provides a solution to address this problem. Leveraging the …

被引用次数：14 相关文章所有 3 个版本

Understanding practical tradeoffs in HPC checkpoint-scheduling policies

N El-Sayed, B Schroeder - IEEE Transactions on Dependable …, 2016 - ieeexplore.ieee.org

As the scale of High-Performance Computing (HPC) clusters continues to grow, their
increasing failure rates and energy consumption levels are emerging as serious design …

被引用次数：23 相关文章所有 2 个版本

[PDF] toronto.edu

Checkpoint/restart in practice: When 'simple is better'

N El-Sayed, B Schroeder - 2014 IEEE International Conference …, 2014 - ieeexplore.ieee.org

Efficient use of high-performance computing (HPC) installations critically relies on effective
methods for fault tolerance. The most commonly used method is checkpoint/restart, where …

被引用次数：27 相关文章所有 5 个版本

An optimal checkpointing model with online OCI adjustment for stream processing applications

Y Zhuang, X Wei, H Li, Y Wang… - 2018 27th International …, 2018 - ieeexplore.ieee.org

Checkpoint-based fault tolerant method has been widely used to enhance the reliability of
Distributed Stream Processing Engines (DSPEs), but a checkpointing process usually …

被引用次数：18 相关文章所有 2 个版本