Addressing failures in exascale computing

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

A taxonomy of grid monitoring systems

S Zanikolas, R Sakellariou - Future Generation Computer Systems, 2005 - Elsevier
Monitoring is the act of collecting information concerning the characteristics and status of
resources of interest. Monitoring grid resources is a lively research area given the …

Stack trace analysis for large scale debugging

DC Arnold, DH Ahn, BR De Supinski… - 2007 IEEE …, 2007 - ieeexplore.ieee.org
We present the Stack Trace Analysis Tool (STAT) to aid in debugging extreme-scale
applications. STAT can reduce problem exploration spaces from thousands of processes to …

Boosting verification by automatic tuning of decision procedures

F Hutter, D Babic, HH Hoos… - Formal Methods in …, 2007 - ieeexplore.ieee.org
Parameterized heuristics abound in computer aided design and verification, and manual
tuning of the respective parameters is difficult and time-consuming. Very recent results from …

ScalaTrace: Scalable compression and replay of communication traces for high-performance computing

M Noeth, P Ratn, F Mueller, M Schulz… - Journal of Parallel and …, 2009 - Elsevier
Characterizing the communication behavior of large-scale applications is a difficult and
costly task due to code/system complexity and long execution times. While many tools to …

Modeling the impact of checkpoints on next-generation systems

RA Oldfield, S Arunagiri, PJ Teller… - … IEEE Conference on …, 2007 - ieeexplore.ieee.org
The next generation of capability-class, massively parallel processing (MPP) systems is
expected to have hundreds of thousands of processors. For application-driven, periodic …

Proactive fault tolerance using preemptive migration

C Engelmann, GR Vallee, T Naughton… - 2009 17th Euromicro …, 2009 - ieeexplore.ieee.org
Proactive fault tolerance (FT) in high-performance computing is a concept that prevents
compute node failures from impacting running parallel applications by preemptively …

Flux: A next-generation resource management framework for large HPC centers

DH Ahn, J Garlick, M Grondona, D Lipari… - 2014 43rd …, 2014 - ieeexplore.ieee.org
Resource and job management software is crucial to High Performance Computing (HPC)
for efficient application execution. However, current systems and approaches can no longer …

Open| SpeedShop: An open source infrastructure for parallel performance analysis

M Schulz, J Galarowicz, D Maghrak… - Scientific …, 2008 - content.iospress.com
Over the last decades a large number of performance tools has been developed to analyze
and optimize high performance applications. Their acceptance by end users, however, has …

A review of supercomputer performance monitoring systems

KS Stefanov, S Pawar, A Ranjan… - Supercomputing …, 2021 - superfri.susu.ru
Abstract High Performance Computing is now one of the emerging fields in computer
science and its applications. Top HPC facilities, supercomputers, offer great opportunities in …