Addressing failures in exascale computing
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …
A taxonomy of grid monitoring systems
S Zanikolas, R Sakellariou - Future Generation Computer Systems, 2005 - Elsevier
Monitoring is the act of collecting information concerning the characteristics and status of
resources of interest. Monitoring grid resources is a lively research area given the …
resources of interest. Monitoring grid resources is a lively research area given the …
Stack trace analysis for large scale debugging
We present the Stack Trace Analysis Tool (STAT) to aid in debugging extreme-scale
applications. STAT can reduce problem exploration spaces from thousands of processes to …
applications. STAT can reduce problem exploration spaces from thousands of processes to …
Boosting verification by automatic tuning of decision procedures
Parameterized heuristics abound in computer aided design and verification, and manual
tuning of the respective parameters is difficult and time-consuming. Very recent results from …
tuning of the respective parameters is difficult and time-consuming. Very recent results from …
ScalaTrace: Scalable compression and replay of communication traces for high-performance computing
Characterizing the communication behavior of large-scale applications is a difficult and
costly task due to code/system complexity and long execution times. While many tools to …
costly task due to code/system complexity and long execution times. While many tools to …
Modeling the impact of checkpoints on next-generation systems
RA Oldfield, S Arunagiri, PJ Teller… - … IEEE Conference on …, 2007 - ieeexplore.ieee.org
The next generation of capability-class, massively parallel processing (MPP) systems is
expected to have hundreds of thousands of processors. For application-driven, periodic …
expected to have hundreds of thousands of processors. For application-driven, periodic …
Proactive fault tolerance using preemptive migration
Proactive fault tolerance (FT) in high-performance computing is a concept that prevents
compute node failures from impacting running parallel applications by preemptively …
compute node failures from impacting running parallel applications by preemptively …
Flux: A next-generation resource management framework for large HPC centers
DH Ahn, J Garlick, M Grondona, D Lipari… - 2014 43rd …, 2014 - ieeexplore.ieee.org
Resource and job management software is crucial to High Performance Computing (HPC)
for efficient application execution. However, current systems and approaches can no longer …
for efficient application execution. However, current systems and approaches can no longer …
Open| SpeedShop: An open source infrastructure for parallel performance analysis
M Schulz, J Galarowicz, D Maghrak… - Scientific …, 2008 - content.iospress.com
Over the last decades a large number of performance tools has been developed to analyze
and optimize high performance applications. Their acceptance by end users, however, has …
and optimize high performance applications. Their acceptance by end users, however, has …
A review of supercomputer performance monitoring systems
KS Stefanov, S Pawar, A Ranjan… - Supercomputing …, 2021 - superfri.susu.ru
Abstract High Performance Computing is now one of the emerging fields in computer
science and its applications. Top HPC facilities, supercomputers, offer great opportunities in …
science and its applications. Top HPC facilities, supercomputers, offer great opportunities in …