Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads...

L Bautista-Gomez, A Benoit, S Di, T Herault… - Future Generation …, 2024 - Elsevier

Abstract The Young/Daly formula provides an approximation of the optimal checkpointing
period for a parallel application executing on a supercomputing platform. It was originally …

被引用次数：1 相关文章所有 8 个版本

[PDF] tsinghua.edu.cn

What can we learn from four years of data center hardware failures?

G Wang, L Zhang, W Xu - 2017 47th Annual IEEE/IFIP …, 2017 - ieeexplore.ieee.org

Hardware failures have a big impact on the dependability of large-scale data centers. We
present studies on over 290,000 hardware failure reports collected over the past four years …

被引用次数：149 相关文章所有 9 个版本

[PDF] osti.gov

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation

D Tiwari, S Gupta, J Rogers, D Maxwell… - 2015 IEEE 21st …, 2015 - ieeexplore.ieee.org

Increase in graphics hardware performance and improvements in programmability has
enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose …

被引用次数：203 相关文章所有 9 个版本

[PDF] usenix.org

{CheckFreq}: Frequent,{Fine-Grained}{DNN} Checkpointing

J Mohan, A Phanishayee, V Chidambaram - 19th USENIX Conference …, 2021 - usenix.org

Training Deep Neural Networks (DNNs) is a resource-hungry and time-consuming task.
During training, the model performs computation at the GPU to learn weights, repeatedly …

被引用次数：88 相关文章所有 6 个版本

[PDF] acm.org

Desh: deep learning for system health prediction of lead times to failure in hpc

A Das, F Mueller, C Siegel, A Vishnu - Proceedings of the 27th …, 2018 - dl.acm.org

Today's large-scale supercomputers encounter faults on a daily basis. Exascale systems are
likely to experience even higher fault rates due to increased component count and density …

被引用次数：117 相关文章所有 4 个版本

[PDF] google.com

Job characteristics on large-scale systems: long-term analysis, quantification, and implications

T Patel, Z Liu, R Kettimuthu, P Rich… - … conference for high …, 2020 - ieeexplore.ieee.org

HPC workload analysis and resource consumption characteristics are the key to driving
better operation practices, system procurement decisions, and designing effective resource …

被引用次数：68 相关文章所有 4 个版本

[PDF] upc.edu

Unprotected computing: A large-scale study of dram raw error rate on a supercomputer

L Bautista-Gomez, F Zyulkyarov… - SC'16: Proceedings …, 2016 - ieeexplore.ieee.org

Supercomputers offer new opportunities for scientific computing as they grow in size.
However, their growth also poses new challenges. Resilience has been recognized as one …

被引用次数：109 相关文章所有 8 个版本

[PDF] wm.edu

A large-scale study of soft-errors on GPUs in the field

B Nie, D Tiwari, S Gupta, E Smirni… - 2016 IEEE International …, 2016 - ieeexplore.ieee.org

Parallelism provided by the GPU architecture has enabled domain scientists to simulate
physical phenomena at a much faster rate and finer granularity than what was previously …

被引用次数：109 相关文章所有 7 个版本

Reliability lessons learned from gpu experience with the titan supercomputer at oak ridge leadership computing facility

D Tiwari, S Gupta, G Gallarno, J Rogers… - Proceedings of the …, 2015 - dl.acm.org

The high computational capability of graphics processing units (GPUs) is enabling and
driving the scientific discovery process at large-scale. The world's second fastest …

被引用次数：101 相关文章所有 6 个版本

[PDF] researchgate.net

An analysis of the current status and countermeasures of bike-sharing in the background of Internet

X Gao, S Zhao, S Yibo - 2018 International Conference on …, 2018 - ieeexplore.ieee.org

With the continuous rapid growth of China's overall economy, the role of urban transport in
social and economic development has become increasingly significant. However, it also …

被引用次数：18 相关文章所有 3 个版本