What does fault tolerant deep learning need from mpi?

KS Chahal, MS Grover, K Dey, RR Shah - Journal of Parallel and …, 2020 - Elsevier

Deep learning has led to tremendous advancements in the field of Artificial Intelligence. One
caveat, however, is the substantial amount of compute needed to train these deep learning …

被引用次数：82 相关文章所有 5 个版本

[PDF] arxiv.org

A study of checkpointing in large scale training of deep neural networks

E Rojas, AN Kahira, E Meneses, LB Gomez… - arXiv preprint arXiv …, 2020 - arxiv.org

Deep learning (DL) applications are increasingly being deployed on HPC systems, to
leverage the massive parallelism and computing power of those systems for DL model …

被引用次数：35 相关文章所有 4 个版本

[PDF] arxiv.org

TRANSOM: An efficient fault-tolerant system for training LLMs

B Wu, L Xia, Q Li, K Li, X Chen, Y Guo, T Xiang… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models (LLMs) represented by chartGPT have achieved profound
applications and breakthroughs in various fields. This demonstrates that LLMs with …

被引用次数：10 相关文章所有 2 个版本

[PDF] arxiv.org

Deep learning reproducibility and explainable AI (XAI)

AM Leventi-Peetz, T Östreich - arXiv preprint arXiv:2202.11452, 2022 - arxiv.org

The nondeterminism of Deep Learning (DL) training algorithms and its influence on the
explainability of neural network (NN) models are investigated in this work with the help of …

被引用次数：13 相关文章所有 2 个版本

[PDF] arxiv.org

Scaling distributed deep learning workloads beyond the memory capacity with KARMA

M Wahib, H Zhang, TT Nguyen, A Drozd… - … Conference for High …, 2020 - ieeexplore.ieee.org

The dedicated memory of hardware accelerators can be insufficient to store all weights
and/or intermediate states of large deep learning models. Although model parallelism is a …

被引用次数：24 相关文章所有 6 个版本

[PDF] arxiv.org

FfDL: A flexible multi-tenant deep learning platform

KR Jayaram, V Muthusamy, P Dube… - Proceedings of the 20th …, 2019 - dl.acm.org

Deep learning (DL) is becoming increasingly popular in several application domains and
has made several new application features involving computer vision, speech recognition …

被引用次数：21 相关文章所有 5 个版本

Edge Intelligence with Distributed Processing of DNNs: A Survey.

S Tang, M Cui, L Qi, X Xu - CMES-Computer Modeling in …, 2023 - search.ebscohost.com

Withthe rapiddevelopment of deep learning, the size of data sets anddeepneuralnetworks
(DNNs) models are also booming. As a result, the intolerable long time for models' training …

被引用次数：3 相关文章

[PDF] plos.org

Fault tolerance in distributed systems using deep learning approaches

B Assiri, A Sheneamer - PloS one, 2025 - journals.plos.org

Recently, distributed systems have become the backbone of technological development. It
serves as the foundation for new trends technologies such as blockchain, the internet of …

[HTML][HTML] Model and system robustness in distributed CNN inference at the edge

X Guo, Q Jiang, AD Pimentel, T Stefanov - Integration, 2025 - Elsevier

Prevalent large CNN models pose a significant challenge in terms of computing resources
for resource-constrained devices at the Edge. Distributing the computations and coefficients …

LiveTune: Dynamic Parameter Tuning for Training Deep Neural Networks

SZ Shabgahi, N Sheybani, A Tabrizi… - arXiv preprint arXiv …, 2023 - arxiv.org

Traditional machine learning training is a static process that lacks real-time adaptability of
hyperparameters. Popular tuning solutions during runtime involve checkpoints and …

被引用次数：1 相关文章