MELA: A visual analytics tool for studying multifidelity hpc system logs

FNU Shilpika, B Lusch, M Emani… - 2019 IEEE/ACM …, 2019 - ieeexplore.ieee.org
2019 IEEE/ACM Industry/University Joint International Workshop on …, 2019ieeexplore.ieee.org
To maintain a robust and reliable supercomputing hardware system there is a critical need
to understand various system events, including failures occurring in the system. Toward this
goal, we analyze various system logs such as error logs, job logs and environment logs from
Argonne Leadership Computing Facility's (ALCF) Theta Cray XC40 supercomputer. This log
data incorporates multiple subsystem and component measurements at various fidelity
levels and temporal resolutions-a very diverse and massive dataset. To effectively identify …
To maintain a robust and reliable supercomputing hardware system there is a critical need to understand various system events, including failures occurring in the system. Toward this goal, we analyze various system logs such as error logs, job logs and environment logs from Argonne Leadership Computing Facility's (ALCF) Theta Cray XC40 supercomputer. This log data incorporates multiple subsystem and component measurements at various fidelity levels and temporal resolutions - a very diverse and massive dataset. To effectively identify various patterns that characterize system behavior and faults over time, we have developed a visual analytics tool, MELA, to better identify patterns and glean insights from these log data.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果