[HTML][HTML] Toward exascale resilience: 2014 update
F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …
systems will typically gather millions of CPU cores running up to a billion threads …
The landscape of exascale research: A data-driven literature analysis
S Heldens, P Hijma, BV Werkhoven… - ACM Computing …, 2020 - dl.acm.org
The next generation of supercomputers will break the exascale barrier. Soon we will have
systems capable of at least one quintillion (billion billion) floating-point operations per …
systems capable of at least one quintillion (billion billion) floating-point operations per …
Rowpress: Amplifying read disturbance in modern dram chips
Memory isolation is critical for system reliability, security, and safety. Unfortunately, read
disturbance can break memory isolation in modern DRAM chips. For example, RowHammer …
disturbance can break memory isolation in modern DRAM chips. For example, RowHammer …
Silent data corruptions at scale
Silent Data Corruption (SDC) can have negative impact on large-scale infrastructure
services. SDCs are not captured by error reporting mechanisms within a Central Processing …
services. SDCs are not captured by error reporting mechanisms within a Central Processing …
Understanding silent data corruptions in a large production cpu population
Silent Data Corruption (SDC) in processors can lead to various application-level issues,
such as incorrect calculations and even data loss. Since traditional techniques are not …
such as incorrect calculations and even data loss. Since traditional techniques are not …
Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory
Memory devices represent a key component of datacenter total cost of ownership (TCO),
and techniques used to reduce errors that occur on these devices increase this cost. Existing …
and techniques used to reduce errors that occur on these devices increase this cost. Existing …
Exploring automatic, online failure recovery for scientific applications at extreme scales
Application resilience is a key challenge that must be addressed in order to realize the
exascale vision. Process/node failures, an important class of failures, are typically handled …
exascale vision. Process/node failures, an important class of failures, are typically handled …
{SOTER}: Guarding Black-box Inference for General Neural Networks at the Edge
The prosperity of AI and edge computing has pushed more and more well-trained DNN
models to be deployed on third-party edge devices to compose mission-critical applications …
models to be deployed on third-party edge devices to compose mission-critical applications …
Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency
Approximate computing environments trade off computational accuracy for improvements in
performance, energy, and resiliency cost. For widespread adoption of approximate …
performance, energy, and resiliency cost. For widespread adoption of approximate …
Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to file-system faults
We analyze how modern distributed storage systems behave in the presence of file-system
faults such as data corruption and read and write errors. We characterize eight popular …
faults such as data corruption and read and write errors. We characterize eight popular …