A hitchhiker's guide on distributed training of deep neural networks

KS Chahal, MS Grover, K Dey, RR Shah - Journal of Parallel and …, 2020 - Elsevier
Deep learning has led to tremendous advancements in the field of Artificial Intelligence. One
caveat, however, is the substantial amount of compute needed to train these deep learning …

A study of checkpointing in large scale training of deep neural networks

E Rojas, AN Kahira, E Meneses, LB Gomez… - arXiv preprint arXiv …, 2020 - arxiv.org
Deep learning (DL) applications are increasingly being deployed on HPC systems, to
leverage the massive parallelism and computing power of those systems for DL model …

TRANSOM: An efficient fault-tolerant system for training LLMs

B Wu, L Xia, Q Li, K Li, X Chen, Y Guo, T Xiang… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LLMs) represented by chartGPT have achieved profound
applications and breakthroughs in various fields. This demonstrates that LLMs with …

Deep learning reproducibility and explainable AI (XAI)

AM Leventi-Peetz, T Östreich - arXiv preprint arXiv:2202.11452, 2022 - arxiv.org
The nondeterminism of Deep Learning (DL) training algorithms and its influence on the
explainability of neural network (NN) models are investigated in this work with the help of …

Scaling distributed deep learning workloads beyond the memory capacity with KARMA

M Wahib, H Zhang, TT Nguyen, A Drozd… - … Conference for High …, 2020 - ieeexplore.ieee.org
The dedicated memory of hardware accelerators can be insufficient to store all weights
and/or intermediate states of large deep learning models. Although model parallelism is a …

FfDL: A flexible multi-tenant deep learning platform

KR Jayaram, V Muthusamy, P Dube… - Proceedings of the 20th …, 2019 - dl.acm.org
Deep learning (DL) is becoming increasingly popular in several application domains and
has made several new application features involving computer vision, speech recognition …

Edge Intelligence with Distributed Processing of DNNs: A Survey.

S Tang, M Cui, L Qi, X Xu - CMES-Computer Modeling in …, 2023 - search.ebscohost.com
Withthe rapiddevelopment of deep learning, the size of data sets anddeepneuralnetworks
(DNNs) models are also booming. As a result, the intolerable long time for models' training …

Fault tolerance in distributed systems using deep learning approaches

B Assiri, A Sheneamer - PloS one, 2025 - journals.plos.org
Recently, distributed systems have become the backbone of technological development. It
serves as the foundation for new trends technologies such as blockchain, the internet of …

[HTML][HTML] Model and system robustness in distributed CNN inference at the edge

X Guo, Q Jiang, AD Pimentel, T Stefanov - Integration, 2025 - Elsevier
Prevalent large CNN models pose a significant challenge in terms of computing resources
for resource-constrained devices at the Edge. Distributing the computations and coefficients …

LiveTune: Dynamic Parameter Tuning for Training Deep Neural Networks

SZ Shabgahi, N Sheybani, A Tabrizi… - arXiv preprint arXiv …, 2023 - arxiv.org
Traditional machine learning training is a static process that lacks real-time adaptability of
hyperparameters. Popular tuning solutions during runtime involve checkpoints and …