Experience report: System log analysis for anomaly detection

S He, J Zhu, P He, MR Lyu - 2016 IEEE 27th international …, 2016 - ieeexplore.ieee.org
Anomaly detection plays an important role in management of modern large-scale distributed
systems. Logs, which record system runtime information, are widely used for anomaly …

On security in publish/subscribe services: A survey

C Esposito, M Ciampi - IEEE Communications Surveys & …, 2014 - ieeexplore.ieee.org
Publish/subscribe services have encountered considerable success in the building of
modern large-scale mission-critical systems. Such systems are characterized by several non …

Lessons learned from the analysis of system failures at petascale: The case of blue waters

C Di Martino, Z Kalbarczyk, RK Iyer… - 2014 44th Annual …, 2014 - ieeexplore.ieee.org
This paper provides an analysis of failures and their impact for Blue Waters, the Cray hybrid
(CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis …

Failures in large scale systems: long-term measurement, analysis, and implications

S Gupta, T Patel, C Engelmann, D Tiwari - Proceedings of the …, 2017 - dl.acm.org
Resilience is one of the key challenges in maintaining high efficiency of future extreme scale
supercomputers. Researchers and system practitioners rely on field-data studies to …

What can we learn from four years of data center hardware failures?

G Wang, L Zhang, W Xu - 2017 47th Annual IEEE/IFIP …, 2017 - ieeexplore.ieee.org
Hardware failures have a big impact on the dependability of large-scale data centers. We
present studies on over 290,000 hardware failure reports collected over the past four years …

Understanding GPU errors on large-scale HPC systems and the implications for system design and operation

D Tiwari, S Gupta, J Rogers, D Maxwell… - 2015 IEEE 21st …, 2015 - ieeexplore.ieee.org
Increase in graphics hardware performance and improvements in programmability has
enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose …

Desh: deep learning for system health prediction of lead times to failure in hpc

A Das, F Mueller, C Siegel, A Vishnu - Proceedings of the 27th …, 2018 - dl.acm.org
Today's large-scale supercomputers encounter faults on a daily basis. Exascale systems are
likely to experience even higher fault rates due to increased component count and density …

Fault prediction under the microscope: A closer look into HPC systems

A Gainaru, F Cappello, M Snir… - SC'12: Proceedings of …, 2012 - ieeexplore.ieee.org
A large percentage of computing capacity in today's large high-performance computing
systems is wasted because of failures. Consequently current research is focusing on …

Event logs for the analysis of software failures: A rule-based approach

M Cinque, D Cotroneo… - IEEE Transactions on …, 2012 - ieeexplore.ieee.org
Event logs have been widely used over the last three decades to analyze the failure
behavior of a variety of systems. Nevertheless, the implementation of the logging …

A large-scale study of soft-errors on GPUs in the field

B Nie, D Tiwari, S Gupta, E Smirni… - 2016 IEEE International …, 2016 - ieeexplore.ieee.org
Parallelism provided by the GPU architecture has enabled domain scientists to simulate
physical phenomena at a much faster rate and finer granularity than what was previously …