Analyzing and mitigating data stalls in DNN training

J Mohan, A Phanishayee, A Raniwala… - arXiv preprint arXiv …, 2020 - arxiv.org
Training Deep Neural Networks (DNNs) is resource-intensive and time-consuming. While
prior research has explored many different ways of reducing DNN training time, the impact of …

Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product

M Zhao, N Agarwal, A Basant, B Gedik, S Pan… - Proceedings of the 49th …, 2022 - dl.acm.org
Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators
(DSA) are used to train increasingly-complex deep learning models. These clusters rely on a …

Galvatron: Efficient transformer training over multiple gpus using automatic parallelism

X Miao, Y Wang, Y Jiang, C Shi, X Nie, H Zhang… - arXiv preprint arXiv …, 2022 - arxiv.org
Transformer models have achieved state-of-the-art performance on various domains of
applications and gradually becomes the foundations of the advanced large deep learning …

VolcanoML: speeding up end-to-end AutoML via scalable search space decomposition

Y Li, Y Shen, W Zhang, C Zhang, B Cui - The VLDB Journal, 2023 - Springer
End-to-end AutoML has attracted intensive interests from both academia and industry which
automatically searches for ML pipelines in a space induced by feature engineering …

Sliceline: Fast, linear-algebra-based slice finding for ml model debugging

S Sagadeeva, M Boehm - … of the 2021 international conference on …, 2021 - dl.acm.org
Slice finding---a recent work on debugging machine learning (ML) models---aims to find the
top-K data slices (eg, conjunctions of predicates such as gender female and degree PhD) …

Autofreeze: Automatically freezing model blocks to accelerate fine-tuning

Y Liu, S Agarwal, S Venkataraman - arXiv preprint arXiv:2102.01386, 2021 - arxiv.org
With the rapid adoption of machine learning (ML), a number of domains now use the
approach of fine tuning models which were pre-trained on a large corpus of data. However …

Distributed deep learning on data systems: a comparative analysis of approaches

Y Zhang, F Mcquillan, N Jayaram, N Kak… - Proceedings of the …, 2021 - par.nsf.gov
Deep learning (DL) is growing in popularity for many data analytics applications, including
among enterprises. Large business-critical datasets in such settings typically reside in …

[PDF][PDF] Cerebro: A layered data platform for scalable deep learning

A Kumar, S Nakandala, Y Zhang, S Li… - … Annual Conference on …, 2021 - par.nsf.gov
Deep learning (DL) is gaining popularity across many domains thanks to tools such as
TensorFlow and easier access to GPUs. But building large-scale DL applications is still too …

Hyper-tune: Towards efficient hyper-parameter tuning at scale

Y Li, Y Shen, H Jiang, W Zhang, J Li, J Liu… - arXiv preprint arXiv …, 2022 - arxiv.org
The ever-growing demand and complexity of machine learning are putting pressure on
hyper-parameter tuning systems: while the evaluation cost of models continues to increase …

Comet: a novel memory-efficient deep learning training framework by using error-bounded lossy compression

S Jin, C Zhang, X Jiang, Y Feng, H Guan, G Li… - arXiv preprint arXiv …, 2021 - arxiv.org
Training wide and deep neural networks (DNNs) require large amounts of storage resources
such as memory because the intermediate activation data must be saved in the memory …