Micro-Architectural features as soft-error markers in embedded safety-critical systems: preliminary study

D Kasap, A Carpegna, A Savino… - 2023 IEEE European …, 2023 - ieeexplore.ieee.org
Radiation-induced soft errors are one of the most challenging issues in Safety Critical Real-
Time Embedded System (SACRES) reliability, usually handled using different flavors of …

Reproducibility, Replicability, and Repeatability: A survey of reproducible research with a focus on high performance computing

BA Antunes, DRC Hill - arXiv preprint arXiv:2402.07530, 2024 - arxiv.org
Reproducibility is widely acknowledged as a fundamental principle in scientific research.
Currently, the scientific community grapples with numerous challenges associated with …

Vulnerability analysis of instructions for SDC-causing error detection

J Gu, W Zheng, Y Zhuang, Q Zhang - IEEE Access, 2019 - ieeexplore.ieee.org
Due to the centralization of communication in the management of data generated by diverse
Internet of Thing (IoT) devices, there is a lack of reliability when data is being transferred and …

Multi-bit data flow error detection method based on SDC vulnerability analysis

Z Yan, Y Zhuang, W Zheng, J Gu - ACM Transactions on Embedded …, 2023 - dl.acm.org
One of the most difficult data flow errors to detect caused by single-event upsets in space
radiation is the Silent Data Corruption (SDC). To solve the problem of multi-bit upsets …

Response of HPC hardware to neutron radiation at the dawn of exascale

A Bustos, AJ Rubio-Montero, R Méndez… - The Journal of …, 2023 - Springer
Every computation presents a small chance that an unexpected phenomenon ruins or
modifies its output. Computers are prone to errors that, although may be very unlikely, are …

Towards end-to-end sdc detection for hpc applications equipped with lossy compression

S Li, S Di, K Zhao, X Liang, Z Chen… - … Conference on Cluster …, 2020 - ieeexplore.ieee.org
Data reduction techniques have been widely demanded and used by large-scale high
performance computing (HPC) applications because of vast volumes of data to be produced …

Anomaly detection in scientific datasets using sparse representation

A Moon, M Kim, J Chen, SW Son - Proceedings of the First Workshop on …, 2023 - dl.acm.org
As the size and complexity of high-performance computing (HPC) systems keep growing,
scientists' ability to trust the data produced is paramount due to potential data corruption for …

Exploring the effects of silent data corruption in distributed deep learning training

E Rojas, D Pérez, E Meneses - 2022 IEEE 34th International …, 2022 - ieeexplore.ieee.org
The profound impact of recent developments in artificial intelligence is unquestionable. The
applications of deep learning models are everywhere, from advanced natural language …

A characterization of soft-error sensitivity in data-parallel and model-parallel distributed deep learning

E Rojas, D Pérez, E Meneses - Journal of Parallel and Distributed …, 2024 - Elsevier
The latest advances in artificial intelligence deep learning models are unprecedented. A
wide spectrum of application areas is now thriving thanks to available massive training …

Silent data corruption estimation and mitigation without fault injection

M Yakhchi, M Fazeli, SA Asghari - IEEE Canadian Journal of …, 2022 - ieeexplore.ieee.org
Silent data corruptions (SDCs) have been always regarded as the serious effect of radiation-
induced faults. Traditional solutions based on redundancies are very expensive in terms of …