Fastflow: Accelerating deep learning model training with smart offloading of input data pipeline

T Um, B Oh, B Seo, M Kweun, G Kim… - Proceedings of the VLDB …, 2023 - dl.acm.org
When training a deep learning (DL) model, input data are pre-processed on CPUs and
transformed into tensors, which are then fed into GPUs for gradient computations of model …

A roadmap for big model

S Yuan, H Zhao, S Zhao, J Leng, Y Liang… - arXiv preprint arXiv …, 2022 - arxiv.org
With the rapid development of deep learning, training Big Models (BMs) for multiple
downstream tasks becomes a popular paradigm. Researchers have achieved various …

Pecan:{Cost-Efficient}{ML} Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement

D Graur, O Mraz, M Li, S Pourghannad… - 2024 USENIX Annual …, 2024 - usenix.org
Input data preprocessing is a common bottleneck in machine learning (ML) jobs, that can
significantly increase training time and cost as expensive GPUs or TPUs idle waiting for …

Accelerating CPU-based distributed DNN training on modern HPC clusters using bluefield-2 DPUs

A Jain, N Alnaasan, A Shafi… - … IEEE Symposium on …, 2021 - ieeexplore.ieee.org
The Deep Learning (DL) training process consists of multiple phases—data augmentation,
training, and validation of the trained model. Traditionally, these phases are executed either …

Optimizing distributed dnn training using cpus and bluefield-2 dpus

A Jain, N Alnaasan, A Shafi, H Subramoni… - IEEE Micro, 2021 - ieeexplore.ieee.org
The deep learning (DL) training process consists of multiple phases—data augmentation,
training, and validation of the trained model. Traditionally, these phases are executed either …

A high-performance dataflow-centric optimization framework for deep learning inference on the edge

R Zhang, H Jiang, J Geng, F Tian, Y Ma… - Journal of Systems …, 2024 - Elsevier
Edge computing has been emerging as a popular scenario for model inference. However,
the inference performance on edge devices (eg, Multi-Core DSP, FGPA, etc.) suffers from …

PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training

H Wang, L Wang, H Xu, Y Wang, Y Li… - Proceedings of the 29th …, 2024 - dl.acm.org
With the rapid up-scaling of transformer-based large language models (LLM), training these
models is becoming increasingly demanding on novel parallel training techniques. Tensor …

Deep Neural Network Augmented Wireless Channel Estimation for Preamble-based OFDM PHY on Zynq System on Chip

AK Gizzini, S Shrey, SJ Darak, S Saurabh… - arXiv preprint arXiv …, 2022 - arxiv.org
Reliable and fast channel estimation is crucial for next-generation wireless networks
supporting a wide range of vehicular and low-latency services. Recently, deep learning (DL) …

TensorCV: Accelerating Inference-Adjacent Computation Using Tensor Processors

D Ha, WW Ro, HW Tseng - 2023 IEEE/ACM International …, 2023 - ieeexplore.ieee.org
The advancements in AI/ML accelerators have made the core AI/ML computation relatively
insignificant in application pipelines. For example, inferencing only accounts for 3% of the …

Accurate Deep Learning Inference Latency Prediction over Dynamic Running Mobile Devices

J Fan, J Hou, X Li - … on Mobility, Sensing and Networking (MSN), 2023 - ieeexplore.ieee.org
With the increasing number of deep learning applications, the optimization of deep learning
model performance has become a central focus of research. One of the critical indicators is …