A hitchhiker's guide on distributed training of deep neural networks
Deep learning has led to tremendous advancements in the field of Artificial Intelligence. One
caveat, however, is the substantial amount of compute needed to train these deep learning …
caveat, however, is the substantial amount of compute needed to train these deep learning …
A study of checkpointing in large scale training of deep neural networks
Deep learning (DL) applications are increasingly being deployed on HPC systems, to
leverage the massive parallelism and computing power of those systems for DL model …
leverage the massive parallelism and computing power of those systems for DL model …
TRANSOM: An efficient fault-tolerant system for training LLMs
Large language models (LLMs) represented by chartGPT have achieved profound
applications and breakthroughs in various fields. This demonstrates that LLMs with …
applications and breakthroughs in various fields. This demonstrates that LLMs with …
Deep learning reproducibility and explainable AI (XAI)
AM Leventi-Peetz, T Östreich - arXiv preprint arXiv:2202.11452, 2022 - arxiv.org
The nondeterminism of Deep Learning (DL) training algorithms and its influence on the
explainability of neural network (NN) models are investigated in this work with the help of …
explainability of neural network (NN) models are investigated in this work with the help of …
Scaling distributed deep learning workloads beyond the memory capacity with KARMA
The dedicated memory of hardware accelerators can be insufficient to store all weights
and/or intermediate states of large deep learning models. Although model parallelism is a …
and/or intermediate states of large deep learning models. Although model parallelism is a …
FfDL: A flexible multi-tenant deep learning platform
Deep learning (DL) is becoming increasingly popular in several application domains and
has made several new application features involving computer vision, speech recognition …
has made several new application features involving computer vision, speech recognition …
Edge Intelligence with Distributed Processing of DNNs: A Survey.
S Tang, M Cui, L Qi, X Xu - CMES-Computer Modeling in …, 2023 - search.ebscohost.com
Withthe rapiddevelopment of deep learning, the size of data sets anddeepneuralnetworks
(DNNs) models are also booming. As a result, the intolerable long time for models' training …
(DNNs) models are also booming. As a result, the intolerable long time for models' training …
Fault tolerance in distributed systems using deep learning approaches
B Assiri, A Sheneamer - PloS one, 2025 - journals.plos.org
Recently, distributed systems have become the backbone of technological development. It
serves as the foundation for new trends technologies such as blockchain, the internet of …
serves as the foundation for new trends technologies such as blockchain, the internet of …
[HTML][HTML] Model and system robustness in distributed CNN inference at the edge
Prevalent large CNN models pose a significant challenge in terms of computing resources
for resource-constrained devices at the Edge. Distributing the computations and coefficients …
for resource-constrained devices at the Edge. Distributing the computations and coefficients …
LiveTune: Dynamic Parameter Tuning for Training Deep Neural Networks
Traditional machine learning training is a static process that lacks real-time adaptability of
hyperparameters. Popular tuning solutions during runtime involve checkpoints and …
hyperparameters. Popular tuning solutions during runtime involve checkpoints and …