Experience report: System log analysis for anomaly detection
Anomaly detection plays an important role in management of modern large-scale distributed
systems. Logs, which record system runtime information, are widely used for anomaly …
systems. Logs, which record system runtime information, are widely used for anomaly …
On security in publish/subscribe services: A survey
C Esposito, M Ciampi - IEEE Communications Surveys & …, 2014 - ieeexplore.ieee.org
Publish/subscribe services have encountered considerable success in the building of
modern large-scale mission-critical systems. Such systems are characterized by several non …
modern large-scale mission-critical systems. Such systems are characterized by several non …
Lessons learned from the analysis of system failures at petascale: The case of blue waters
C Di Martino, Z Kalbarczyk, RK Iyer… - 2014 44th Annual …, 2014 - ieeexplore.ieee.org
This paper provides an analysis of failures and their impact for Blue Waters, the Cray hybrid
(CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis …
(CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis …
Failures in large scale systems: long-term measurement, analysis, and implications
Resilience is one of the key challenges in maintaining high efficiency of future extreme scale
supercomputers. Researchers and system practitioners rely on field-data studies to …
supercomputers. Researchers and system practitioners rely on field-data studies to …
What can we learn from four years of data center hardware failures?
Hardware failures have a big impact on the dependability of large-scale data centers. We
present studies on over 290,000 hardware failure reports collected over the past four years …
present studies on over 290,000 hardware failure reports collected over the past four years …
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
D Tiwari, S Gupta, J Rogers, D Maxwell… - 2015 IEEE 21st …, 2015 - ieeexplore.ieee.org
Increase in graphics hardware performance and improvements in programmability has
enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose …
enabled GPUs to evolve from a graphics-specific accelerator to a general-purpose …
Desh: deep learning for system health prediction of lead times to failure in hpc
Today's large-scale supercomputers encounter faults on a daily basis. Exascale systems are
likely to experience even higher fault rates due to increased component count and density …
likely to experience even higher fault rates due to increased component count and density …
Fault prediction under the microscope: A closer look into HPC systems
A large percentage of computing capacity in today's large high-performance computing
systems is wasted because of failures. Consequently current research is focusing on …
systems is wasted because of failures. Consequently current research is focusing on …
Event logs for the analysis of software failures: A rule-based approach
M Cinque, D Cotroneo… - IEEE Transactions on …, 2012 - ieeexplore.ieee.org
Event logs have been widely used over the last three decades to analyze the failure
behavior of a variety of systems. Nevertheless, the implementation of the logging …
behavior of a variety of systems. Nevertheless, the implementation of the logging …
A large-scale study of soft-errors on GPUs in the field
Parallelism provided by the GPU architecture has enabled domain scientists to simulate
physical phenomena at a much faster rate and finer granularity than what was previously …
physical phenomena at a much faster rate and finer granularity than what was previously …