[HTML][HTML] Toward exascale resilience: 2014 update

F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

The landscape of exascale research: A data-driven literature analysis

S Heldens, P Hijma, BV Werkhoven… - ACM Computing …, 2020 - dl.acm.org
The next generation of supercomputers will break the exascale barrier. Soon we will have
systems capable of at least one quintillion (billion billion) floating-point operations per …

Rowpress: Amplifying read disturbance in modern dram chips

H Luo, A Olgun, AG Yağlıkçı, YC Tuğrul… - Proceedings of the 50th …, 2023 - dl.acm.org
Memory isolation is critical for system reliability, security, and safety. Unfortunately, read
disturbance can break memory isolation in modern DRAM chips. For example, RowHammer …

Silent data corruptions at scale

HD Dixit, S Pendharkar, M Beadon, C Mason… - arXiv preprint arXiv …, 2021 - arxiv.org
Silent Data Corruption (SDC) can have negative impact on large-scale infrastructure
services. SDCs are not captured by error reporting mechanisms within a Central Processing …

Understanding silent data corruptions in a large production cpu population

S Wang, G Zhang, J Wei, Y Wang, J Wu… - Proceedings of the 29th …, 2023 - dl.acm.org
Silent Data Corruption (SDC) in processors can lead to various application-level issues,
such as incorrect calculations and even data loss. Since traditional techniques are not …

Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory

Y Luo, S Govindan, B Sharma… - 2014 44th Annual …, 2014 - ieeexplore.ieee.org
Memory devices represent a key component of datacenter total cost of ownership (TCO),
and techniques used to reduce errors that occur on these devices increase this cost. Existing …

Exploring automatic, online failure recovery for scientific applications at extreme scales

M Gamell, DS Katz, H Kolla, J Chen… - SC'14: Proceedings …, 2014 - ieeexplore.ieee.org
Application resilience is a key challenge that must be addressed in order to realize the
exascale vision. Process/node failures, an important class of failures, are typically handled …

{SOTER}: Guarding Black-box Inference for General Neural Networks at the Edge

T Shen, J Qi, J Jiang, X Wang, S Wen, X Chen… - 2022 USENIX Annual …, 2022 - usenix.org
The prosperity of AI and edge computing has pushed more and more well-trained DNN
models to be deployed on third-party edge devices to compose mission-critical applications …

Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency

R Venkatagiri, A Mahmoud, SKS Hari… - 2016 49th Annual …, 2016 - ieeexplore.ieee.org
Approximate computing environments trade off computational accuracy for improvements in
performance, energy, and resiliency cost. For widespread adoption of approximate …

Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to file-system faults

A Ganesan, R Alagappan, AC Arpaci-Dusseau… - ACM Transactions on …, 2017 - dl.acm.org
We analyze how modern distributed storage systems behave in the presence of file-system
faults such as data corruption and read and write errors. We characterize eight popular …