Fastflow: Accelerating deep learning model training with smart offloading of input data pipeline
When training a deep learning (DL) model, input data are pre-processed on CPUs and
transformed into tensors, which are then fed into GPUs for gradient computations of model …
transformed into tensors, which are then fed into GPUs for gradient computations of model …
Pecan:{Cost-Efficient}{ML} Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement
D Graur, O Mraz, M Li, S Pourghannad… - 2024 USENIX Annual …, 2024 - usenix.org
Input data preprocessing is a common bottleneck in machine learning (ML) jobs, that can
significantly increase training time and cost as expensive GPUs or TPUs idle waiting for …
significantly increase training time and cost as expensive GPUs or TPUs idle waiting for …
Accelerating CPU-based distributed DNN training on modern HPC clusters using bluefield-2 DPUs
The Deep Learning (DL) training process consists of multiple phases—data augmentation,
training, and validation of the trained model. Traditionally, these phases are executed either …
training, and validation of the trained model. Traditionally, these phases are executed either …
Optimizing distributed dnn training using cpus and bluefield-2 dpus
The deep learning (DL) training process consists of multiple phases—data augmentation,
training, and validation of the trained model. Traditionally, these phases are executed either …
training, and validation of the trained model. Traditionally, these phases are executed either …
A high-performance dataflow-centric optimization framework for deep learning inference on the edge
R Zhang, H Jiang, J Geng, F Tian, Y Ma… - Journal of Systems …, 2024 - Elsevier
Edge computing has been emerging as a popular scenario for model inference. However,
the inference performance on edge devices (eg, Multi-Core DSP, FGPA, etc.) suffers from …
the inference performance on edge devices (eg, Multi-Core DSP, FGPA, etc.) suffers from …
PrimePar: Efficient Spatial-temporal Tensor Partitioning for Large Transformer Model Training
H Wang, L Wang, H Xu, Y Wang, Y Li… - Proceedings of the 29th …, 2024 - dl.acm.org
With the rapid up-scaling of transformer-based large language models (LLM), training these
models is becoming increasingly demanding on novel parallel training techniques. Tensor …
models is becoming increasingly demanding on novel parallel training techniques. Tensor …
Deep Neural Network Augmented Wireless Channel Estimation for Preamble-based OFDM PHY on Zynq System on Chip
Reliable and fast channel estimation is crucial for next-generation wireless networks
supporting a wide range of vehicular and low-latency services. Recently, deep learning (DL) …
supporting a wide range of vehicular and low-latency services. Recently, deep learning (DL) …
TensorCV: Accelerating Inference-Adjacent Computation Using Tensor Processors
The advancements in AI/ML accelerators have made the core AI/ML computation relatively
insignificant in application pipelines. For example, inferencing only accounts for 3% of the …
insignificant in application pipelines. For example, inferencing only accounts for 3% of the …
Accurate Deep Learning Inference Latency Prediction over Dynamic Running Mobile Devices
J Fan, J Hou, X Li - … on Mobility, Sensing and Networking (MSN), 2023 - ieeexplore.ieee.org
With the increasing number of deep learning applications, the optimization of deep learning
model performance has become a central focus of research. One of the critical indicators is …
model performance has become a central focus of research. One of the critical indicators is …