A survey of resource-efficient llm and multimodal foundation models
Large foundation models, including large language models (LLMs), vision transformers
(ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine …
(ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine …
Parcae: Proactive,{Liveput-Optimized}{DNN} Training on Preemptible Instances
Deep neural networks (DNNs) are becoming progressively large and costly to train. This
paper aims to reduce DNN training costs by leveraging preemptible instances on modern …
paper aims to reduce DNN training costs by leveraging preemptible instances on modern …
Unicron: Economizing self-healing llm training at scale
Training large-scale language models is increasingly critical in various domains, but it is
hindered by frequent failures, leading to significant time and economic costs. Current failure …
hindered by frequent failures, leading to significant time and economic costs. Current failure …
Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures
T Gupta, S Krishnan, R Kumar, A Vijeev… - Proceedings of the …, 2024 - dl.acm.org
Deep Learning training jobs process large amounts of training data using many GPU
devices, often running for weeks or months. When hardware or software failures happen …
devices, often running for weeks or months. When hardware or software failures happen …
SlipStream: Adapting Pipelines for Distributed Training of Large DNNs Amid Failures
Training large Deep Neural Network (DNN) models requires thousands of GPUs for days or
weeks at a time. At these scales, failures are frequent and can have a big impact on training …
weeks at a time. At these scales, failures are frequent and can have a big impact on training …
Token-wise Influential Training Data Retrieval for Large Language Models
Given a Large Language Model (LLM) generation, how can we identify which training data
led to this generation? In this paper, we proposed RapidIn, a scalable framework adapting to …
led to this generation? In this paper, we proposed RapidIn, a scalable framework adapting to …
TENPLEX: Changing Resources of Deep Learning Jobs using Parallelizable Tensor Collections
Deep learning (DL) jobs use multi-dimensional parallelism, ie they combine data, model,
and pipeline parallelism, to use large GPU clusters efficiently. This couples jobs tightly to a …
and pipeline parallelism, to use large GPU clusters efficiently. This couples jobs tightly to a …
PARALLELGPUOS: A Concurrent OS-level GPU Checkpoint and Restore System using Validated Speculation
Checkpointing (C) and restoring (R) are key components for GPU tasks. POS is an OS-level
GPU C/R system: It can transparently checkpoint or restore processes that use the GPU …
GPU C/R system: It can transparently checkpoint or restore processes that use the GPU …
DeepVM: Integrating Spot and On-Demand VMs for Cost-Efficient Deep Learning Clusters in the Cloud
Distributed Deep Learning (DDL), as a paradigm, dictates the use of GPU-based clusters as
the optimal infrastructure for training large-scale Deep Neural Networks (DNNs). However …
the optimal infrastructure for training large-scale Deep Neural Networks (DNNs). However …
Building efficient and practical machine learning systems
Q Hu - 2023 - dr.ntu.edu.sg
With the widespread adoption of deep learning (DL) applications in recent years, training DL
models has become increasingly prevalent. Nevertheless, training these models is typically …
models has become increasingly prevalent. Nevertheless, training these models is typically …