Analyzing and mitigating data stalls in DNN training
Training Deep Neural Networks (DNNs) is resource-intensive and time-consuming. While
prior research has explored many different ways of reducing DNN training time, the impact of …
prior research has explored many different ways of reducing DNN training time, the impact of …
Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product
Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators
(DSA) are used to train increasingly-complex deep learning models. These clusters rely on a …
(DSA) are used to train increasingly-complex deep learning models. These clusters rely on a …
Galvatron: Efficient transformer training over multiple gpus using automatic parallelism
Transformer models have achieved state-of-the-art performance on various domains of
applications and gradually becomes the foundations of the advanced large deep learning …
applications and gradually becomes the foundations of the advanced large deep learning …
VolcanoML: speeding up end-to-end AutoML via scalable search space decomposition
End-to-end AutoML has attracted intensive interests from both academia and industry which
automatically searches for ML pipelines in a space induced by feature engineering …
automatically searches for ML pipelines in a space induced by feature engineering …
Sliceline: Fast, linear-algebra-based slice finding for ml model debugging
S Sagadeeva, M Boehm - … of the 2021 international conference on …, 2021 - dl.acm.org
Slice finding---a recent work on debugging machine learning (ML) models---aims to find the
top-K data slices (eg, conjunctions of predicates such as gender female and degree PhD) …
top-K data slices (eg, conjunctions of predicates such as gender female and degree PhD) …
Autofreeze: Automatically freezing model blocks to accelerate fine-tuning
With the rapid adoption of machine learning (ML), a number of domains now use the
approach of fine tuning models which were pre-trained on a large corpus of data. However …
approach of fine tuning models which were pre-trained on a large corpus of data. However …
Distributed deep learning on data systems: a comparative analysis of approaches
Deep learning (DL) is growing in popularity for many data analytics applications, including
among enterprises. Large business-critical datasets in such settings typically reside in …
among enterprises. Large business-critical datasets in such settings typically reside in …
[PDF][PDF] Cerebro: A layered data platform for scalable deep learning
Deep learning (DL) is gaining popularity across many domains thanks to tools such as
TensorFlow and easier access to GPUs. But building large-scale DL applications is still too …
TensorFlow and easier access to GPUs. But building large-scale DL applications is still too …
Hyper-tune: Towards efficient hyper-parameter tuning at scale
The ever-growing demand and complexity of machine learning are putting pressure on
hyper-parameter tuning systems: while the evaluation cost of models continues to increase …
hyper-parameter tuning systems: while the evaluation cost of models continues to increase …
Comet: a novel memory-efficient deep learning training framework by using error-bounded lossy compression
Training wide and deep neural networks (DNNs) require large amounts of storage resources
such as memory because the intermediate activation data must be saved in the memory …
such as memory because the intermediate activation data must be saved in the memory …