- 学术资源搜索

Addressing failures in exascale computing

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com

We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

被引用次数：528 相关文章所有 20 个版本

[PDF] man.ac.uk

A taxonomy of grid monitoring systems

S Zanikolas, R Sakellariou - Future Generation Computer Systems, 2005 - Elsevier

Monitoring is the act of collecting information concerning the characteristics and status of
resources of interest. Monitoring grid resources is a lively research area given the …

被引用次数：371 相关文章所有 8 个版本

[PDF] wisconsin.edu

Stack trace analysis for large scale debugging

DC Arnold, DH Ahn, BR De Supinski… - 2007 IEEE …, 2007 - ieeexplore.ieee.org

We present the Stack Trace Analysis Tool (STAT) to aid in debugging extreme-scale
applications. STAT can reduce problem exploration spaces from thousands of processes to …

被引用次数：202 相关文章所有 22 个版本

[PDF] psu.edu

Boosting verification by automatic tuning of decision procedures

F Hutter, D Babic, HH Hoos… - Formal Methods in …, 2007 - ieeexplore.ieee.org

Parameterized heuristics abound in computer aided design and verification, and manual
tuning of the respective parameters is difficult and time-consuming. Very recent results from …

被引用次数：194 相关文章所有 18 个版本

[PDF] osti.gov

ScalaTrace: Scalable compression and replay of communication traces for high-performance computing

M Noeth, P Ratn, F Mueller, M Schulz… - Journal of Parallel and …, 2009 - Elsevier

Characterizing the communication behavior of large-scale applications is a difficult and
costly task due to code/system complexity and long execution times. While many tools to …

被引用次数：176 相关文章所有 10 个版本

[PDF] osti.gov

Modeling the impact of checkpoints on next-generation systems

RA Oldfield, S Arunagiri, PJ Teller… - … IEEE Conference on …, 2007 - ieeexplore.ieee.org

The next generation of capability-class, massively parallel processing (MPP) systems is
expected to have hundreds of thousands of processors. For application-driven, periodic …

被引用次数：184 相关文章所有 15 个版本

[PDF] researchgate.net

Proactive fault tolerance using preemptive migration

C Engelmann, GR Vallee, T Naughton… - 2009 17th Euromicro …, 2009 - ieeexplore.ieee.org

Proactive fault tolerance (FT) in high-performance computing is a concept that prevents
compute node failures from impacting running parallel applications by preemptively …

被引用次数：153 相关文章所有 17 个版本

[PDF] flux-framework.org

Flux: A next-generation resource management framework for large HPC centers

DH Ahn, J Garlick, M Grondona, D Lipari… - 2014 43rd …, 2014 - ieeexplore.ieee.org

Resource and job management software is crucial to High Performance Computing (HPC)
for efficient application execution. However, current systems and approaches can no longer …

被引用次数：93 相关文章所有 7 个版本

[PDF] wiley.com

Open| SpeedShop: An open source infrastructure for parallel performance analysis

M Schulz, J Galarowicz, D Maghrak… - Scientific …, 2008 - content.iospress.com

Over the last decades a large number of performance tools has been developed to analyze
and optimize high performance applications. Their acceptance by end users, however, has …

被引用次数：138 相关文章所有 10 个版本

[PDF] susu.ru

A review of supercomputer performance monitoring systems

KS Stefanov, S Pawar, A Ranjan… - Supercomputing …, 2021 - superfri.susu.ru

Abstract High Performance Computing is now one of the emerging fields in computer
science and its applications. Top HPC facilities, supercomputers, offer great opportunities in …

被引用次数：9 相关文章所有 4 个版本