A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?
Abstract The Young/Daly formula provides an approximation of the optimal checkpointing
period for a parallel application executing on a supercomputing platform. It was originally …
period for a parallel application executing on a supercomputing platform. It was originally …
What can we learn from four years of data center hardware failures?
Hardware failures have a big impact on the dependability of large-scale data centers. We
present studies on over 290,000 hardware failure reports collected over the past four years …
present studies on over 290,000 hardware failure reports collected over the past four years …
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
D Tiwari, S Gupta, J Rogers, D Maxwell… - 2015 IEEE 21st …, 2015 - ieeexplore.ieee.org
Increase in graphics hardware performance and improvements in programmability has
enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose …
enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose …
{CheckFreq}: Frequent,{Fine-Grained}{DNN} Checkpointing
Training Deep Neural Networks (DNNs) is a resource-hungry and time-consuming task.
During training, the model performs computation at the GPU to learn weights, repeatedly …
During training, the model performs computation at the GPU to learn weights, repeatedly …
Desh: deep learning for system health prediction of lead times to failure in hpc
Today's large-scale supercomputers encounter faults on a daily basis. Exascale systems are
likely to experience even higher fault rates due to increased component count and density …
likely to experience even higher fault rates due to increased component count and density …
Job characteristics on large-scale systems: long-term analysis, quantification, and implications
HPC workload analysis and resource consumption characteristics are the key to driving
better operation practices, system procurement decisions, and designing effective resource …
better operation practices, system procurement decisions, and designing effective resource …
Unprotected computing: A large-scale study of dram raw error rate on a supercomputer
L Bautista-Gomez, F Zyulkyarov… - SC'16: Proceedings …, 2016 - ieeexplore.ieee.org
Supercomputers offer new opportunities for scientific computing as they grow in size.
However, their growth also poses new challenges. Resilience has been recognized as one …
However, their growth also poses new challenges. Resilience has been recognized as one …
A large-scale study of soft-errors on GPUs in the field
Parallelism provided by the GPU architecture has enabled domain scientists to simulate
physical phenomena at a much faster rate and finer granularity than what was previously …
physical phenomena at a much faster rate and finer granularity than what was previously …
Reliability lessons learned from gpu experience with the titan supercomputer at oak ridge leadership computing facility
D Tiwari, S Gupta, G Gallarno, J Rogers… - Proceedings of the …, 2015 - dl.acm.org
The high computational capability of graphics processing units (GPUs) is enabling and
driving the scientific discovery process at large-scale. The world's second fastest …
driving the scientific discovery process at large-scale. The world's second fastest …
An analysis of the current status and countermeasures of bike-sharing in the background of Internet
X Gao, S Zhao, S Yibo - 2018 International Conference on …, 2018 - ieeexplore.ieee.org
With the continuous rapid growth of China's overall economy, the role of urban transport in
social and economic development has become increasingly significant. However, it also …
social and economic development has become increasingly significant. However, it also …