Argobots: A lightweight low-level threading and tasking framework S Seo, A Amer, P Balaji, C Bordage, G Bosilca, A Brooks, P Carns, ... IEEE Transactions on Parallel and Distributed Systems 29 (3), 512-526, 2017 | 147 | 2017 |
ACR: Automatic checkpoint/restart for soft and hard error protection X Ni, E Meneses, N Jain, LV Kalé Proceedings of the international conference on high performance computing …, 2013 | 112 | 2013 |
Periodic hierarchical load balancing for large supercomputers G Zheng, A Bhatele, E Meneses, LV Kale The International Journal of High Performance Computing Applications 25 (4 …, 2011 | 111 | 2011 |
Hierarchical load balancing for charm++ applications on large supercomputers G Zheng, E Meneses, A Bhatele, LV Kale 2010 39th International Conference on Parallel Processing Workshops, 436-444, 2010 | 97 | 2010 |
Team-based message logging: Preliminary results E Meneses, CL Mendes, LV Kalé 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid …, 2010 | 63 | 2010 |
Assessing energy efficiency of fault tolerance protocols for HPC systems E Meneses, O Sarood, LV Kalé 2012 IEEE 24th International Symposium on Computer Architecture and High …, 2012 | 60 | 2012 |
On the use of cluster-based partial message logging to improve fault tolerance for mpi hpc applications T Ropars, A Guermouche, B Uçar, E Meneses, LV Kalé, F Cappello Euro-Par 2011 Parallel Processing: 17th International Conference, Euro-Par …, 2011 | 60 | 2011 |
Hiding checkpoint overhead in HPC applications with a semi-blocking algorithm X Ni, E Meneses, LV Kalé 2012 IEEE International Conference on Cluster Computing, 364-372, 2012 | 57 | 2012 |
Using migratable objects to enhance fault tolerance schemes in supercomputers E Meneses, X Ni, G Zheng, CL Mendes, LV Kale IEEE transactions on parallel and distributed systems 26 (7), 2061-2074, 2014 | 49 | 2014 |
A'cool'way of improving the reliability of hpc machines O Sarood, E Meneses, LV Kale Proceedings of the International Conference on High Performance Computing …, 2013 | 42 | 2013 |
Energy profile of rollback-recovery strategies in high performance computing E Meneses, O Sarood, LV Kalé Parallel Computing 40 (9), 536-547, 2014 | 41 | 2014 |
Communication and topology-aware load balancing in Charm++ with TreeMatch E Jeannot, E Meneses, G Mercier, F Tessier, G Zheng 2013 IEEE International Conference on Cluster Computing (CLUSTER), 1-8, 2013 | 33 | 2013 |
Power, reliability, and performance: One system to rule them all B Acun, A Langer, E Meneses, H Menon, O Sarood, E Totoni, LV Kalé Computer 49 (10), 30-37, 2016 | 32 | 2016 |
Evaluation of simple causal message logging for large-scale fault tolerant HPC systems E Meneses, G Bronevetsky, LV Kale 2011 IEEE International Symposium on Parallel and Distributed Processing …, 2011 | 30 | 2011 |
A study of checkpointing in large scale training of deep neural networks E Rojas, AN Kahira, E Meneses, LB Gomez, RM Badia arXiv preprint arXiv:2012.00825, 2020 | 29 | 2020 |
Scalable replay with partial-order dependencies for message-logging fault tolerance J Lifflander, E Meneses, H Menon, P Miller, S Krishnamoorthy, LV Kalé 2014 IEEE International Conference on Cluster Computing (CLUSTER), 19-28, 2014 | 25 | 2014 |
A message-logging protocol for multicore systems E Meneses, X Ni, LV Kalé IEEE/IFIP International Conference on Dependable Systems and Networks …, 2012 | 25 | 2012 |
Analyzing the interplay of failures and workload on a leadership-class supercomputer E Meneses, X Ni, T Jones, D Maxwell computing 2 (3), 4, 2015 | 23 | 2015 |
Dynamic load balance for optimized message logging in fault tolerant hpc applications E Meneses, LV Kalé, G Bronevetsky 2011 IEEE International Conference on Cluster Computing, 281-289, 2011 | 18 | 2011 |
Analyzing a five-year failure record of a leadership-class supercomputer E Rojas, E Meneses, T Jones, D Maxwell 2019 31st International Symposium on Computer Architecture and High …, 2019 | 17 | 2019 |