Proactive fault tolerance for HPC with Xen virtualization AB Nagarajan, F Mueller, C Engelmann, SL Scott Proceedings of the 21st annual international conference on Supercomputing, 23-32, 2007 | 527 | 2007 |
Addressing failures in exascale computing M Snir, RW Wisniewski, JA Abraham, SV Adve, S Bagchi, P Balaji, J Belak, ... The International Journal of High Performance Computing Applications 28 (2 …, 2014 | 521 | 2014 |
Detection and correction of silent data corruption for large-scale high-performance computing D Fiala, F Mueller, C Engelmann, R Riesen, K Ferreira, R Brightwell SC'12: Proceedings of the International Conference on High Performance …, 2012 | 385 | 2012 |
Proactive process-level live migration in HPC environments C Wang, F Mueller, C Engelmann, SL Scott SC'08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 1-12, 2008 | 250 | 2008 |
Combining partial redundancy and checkpointing for HPC J Elliott, K Kharbas, D Fiala, F Mueller, K Ferreira, C Engelmann 2012 IEEE 32nd International Conference on Distributed Computing Systems …, 2012 | 202 | 2012 |
Failures in large scale systems: Long-term measurement, analysis, and implications S Gupta, T Patel, C Engelmann, D Tiwari Proceedings of the International Conference for High Performance Computing …, 2017 | 169 | 2017 |
Proactive fault tolerance using preemptive migration C Engelmann, GR Vallee, T Naughton, SL Scott 2009 17th Euromicro International Conference on Parallel, Distributed and …, 2009 | 152 | 2009 |
Functional partitioning to optimize end-to-end performance on many-core architectures M Li, SS Vazhkudai, AR Butt, F Meng, X Ma, Y Kim, C Engelmann, ... SC'10: Proceedings of the 2010 ACM/IEEE International Conference for High …, 2010 | 118 | 2010 |
A job pause service under LAM/MPI+ BLCR for transparent fault tolerance C Wang, F Mueller, C Engelmann, SL Scott 2007 IEEE International Parallel and Distributed Processing Symposium, 1-10, 2007 | 115 | 2007 |
The case for modular redundancy in large-scale high performance computing systems C Engelmann, HH Ong, SL Scott Proceedings of the 8th IASTED international conference on parallel and …, 2009 | 110 | 2009 |
NVMalloc: Exposing an aggregate SSD store as a memory partition in extreme-scale machines C Wang, SS Vazhkudai, X Ma, F Meng, Y Kim, C Engelmann 2012 IEEE 26th International Parallel and Distributed Processing Symposium …, 2012 | 100 | 2012 |
A framework for proactive fault tolerance G Vallee, K Charoenpornwattana, C Engelmann, A Tikotekar, ... 2008 Third International Conference on Availability, Reliability and …, 2008 | 93 | 2008 |
High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development N DeBardeleben, J Laros, JT Daly, SL Scott, C Engelmann, B Harrod Whitepaper, Dec, 2009 | 88 | 2009 |
System-level virtualization for high performance computing G Vallee, T Naughton, C Engelmann, H Ong, SL Scott 16th Euromicro Conference on Parallel, Distributed and Network-Based …, 2008 | 86 | 2008 |
Machine learning models for GPU error prediction in a large scale HPC system B Nie, J Xue, S Gupta, T Patel, C Engelmann, E Smirni, D Tiwari 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems …, 2018 | 82 | 2018 |
Super-scalable algorithms for computing on 100,000 processors C Engelmann, A Geist International Conference on Computational Science, 313-321, 2005 | 82 | 2005 |
Hybrid checkpointing for MPI jobs in HPC environments C Wang, F Mueller, C Engelmann, SL Scott 2010 IEEE 16th International Conference on Parallel and Distributed Systems …, 2010 | 80 | 2010 |
Redundant execution of HPC applications with MR-MPI C Engelmann, S Böhm Proceedings of the 10th IASTED International Conference on Parallel and …, 2011 | 79 | 2011 |
Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale C Engelmann Future Generation Computer Systems 30, 59-65, 2014 | 70 | 2014 |
xSim: The extreme-scale simulator S Böhm, C Engelmann 2011 International Conference on High Performance Computing & Simulation …, 2011 | 67 | 2011 |