Improving log-based field failure data analysis of multi-node computing systems

S He, J Zhu, P He, MR Lyu - 2016 IEEE 27th international …, 2016 - ieeexplore.ieee.org

Anomaly detection plays an important role in management of modern large-scale distributed
systems. Logs, which record system runtime information, are widely used for anomaly …

被引用次数：701 相关文章所有 9 个版本

[PDF] researchgate.net

On security in publish/subscribe services: A survey

C Esposito, M Ciampi - IEEE Communications Surveys & …, 2014 - ieeexplore.ieee.org

Publish/subscribe services have encountered considerable success in the building of
modern large-scale mission-critical systems. Such systems are characterized by several non …

被引用次数：85 相关文章所有 4 个版本

[PDF] archive.org

Lessons learned from the analysis of system failures at petascale: The case of blue waters

C Di Martino, Z Kalbarczyk, RK Iyer… - 2014 44th Annual …, 2014 - ieeexplore.ieee.org

This paper provides an analysis of failures and their impact for Blue Waters, the Cray hybrid
(CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis …

被引用次数：278 相关文章所有 5 个版本

[PDF] acm.org

Failures in large scale systems: long-term measurement, analysis, and implications

S Gupta, T Patel, C Engelmann, D Tiwari - Proceedings of the …, 2017 - dl.acm.org

Resilience is one of the key challenges in maintaining high efficiency of future extreme scale
supercomputers. Researchers and system practitioners rely on field-data studies to …

被引用次数：184 相关文章所有 12 个版本

[PDF] tsinghua.edu.cn

What can we learn from four years of data center hardware failures?

G Wang, L Zhang, W Xu - 2017 47th Annual IEEE/IFIP …, 2017 - ieeexplore.ieee.org

Hardware failures have a big impact on the dependability of large-scale data centers. We
present studies on over 290,000 hardware failure reports collected over the past four years …

被引用次数：148 相关文章所有 9 个版本

[PDF] osti.gov

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation

D Tiwari, S Gupta, J Rogers, D Maxwell… - 2015 IEEE 21st …, 2015 - ieeexplore.ieee.org

Increase in graphics hardware performance and improvements in programmability has
enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose …

被引用次数：203 相关文章所有 9 个版本

[PDF] acm.org

Desh: deep learning for system health prediction of lead times to failure in hpc

A Das, F Mueller, C Siegel, A Vishnu - Proceedings of the 27th …, 2018 - dl.acm.org

Today's large-scale supercomputers encounter faults on a daily basis. Exascale systems are
likely to experience even higher fault rates due to increased component count and density …

被引用次数：116 相关文章所有 4 个版本

[PDF] illinois.edu

Fault prediction under the microscope: A closer look into HPC systems

A Gainaru, F Cappello, M Snir… - SC'12: Proceedings of …, 2012 - ieeexplore.ieee.org

A large percentage of computing capacity in today's large high-performance computing
systems is wasted because of failures. Consequently current research is focusing on …

被引用次数：179 相关文章所有 14 个版本

[PDF] usthb.dz

Event logs for the analysis of software failures: A rule-based approach

M Cinque, D Cotroneo… - IEEE Transactions on …, 2012 - ieeexplore.ieee.org

Event logs have been widely used over the last three decades to analyze the failure
behavior of a variety of systems. Nevertheless, the implementation of the logging …

被引用次数：168 相关文章所有 9 个版本

[PDF] wm.edu

A large-scale study of soft-errors on GPUs in the field

B Nie, D Tiwari, S Gupta, E Smirni… - 2016 IEEE International …, 2016 - ieeexplore.ieee.org

Parallelism provided by the GPU architecture has enabled domain scientists to simulate
physical phenomena at a much faster rate and finer granularity than what was previously …

被引用次数：109 相关文章所有 7 个版本