High-level synthesis hardware design for fpga-based accelerators: Models, methodologies, and frameworks
Hardware accelerators based on field programmable gate array (FPGA) and system on chip
(SoC) devices have gained attention in recent years. One of the main reasons is that these …
(SoC) devices have gained attention in recent years. One of the main reasons is that these …
Distributed and deep vertical federated learning with big data
In recent years, data are typically distributed in multiple organizations while the data security
is becoming increasingly important. Federated learning (FL), which enables multiple parties …
is becoming increasingly important. Federated learning (FL), which enables multiple parties …
DLB: a dynamic load balance strategy for distributed training of deep neural networks
Synchronous strategies with data parallelism are widely utilized in distributed training of
Deep Neural Networks (DNNs), largely owing to their easy implementation yet promising …
Deep Neural Networks (DNNs), largely owing to their easy implementation yet promising …
Aladdin: Asymmetric centralized training for distributed deep learning
To speed up the training of massive deep neural network (DNN) models, distributed training
has been widely studied. In general, a centralized training, a type of distributed training …
has been widely studied. In general, a centralized training, a type of distributed training …
Virtualflow: Decoupling deep learning models from the underlying hardware
We propose VirtualFlow, a system leveraging a novel abstraction called virtual node
processing to decouple the model from the hardware. In each step of training or inference …
processing to decouple the model from the hardware. In each step of training or inference …
FLSGD: free local SGD with parallel synchronization
Synchronous parameters algorithms with data parallelism have been successfully utilized to
accelerate the distributed training of deep neural networks (DNNs). However, a prevalent …
accelerate the distributed training of deep neural networks (DNNs). However, a prevalent …
SHAT: A Novel Asynchronous Training Algorithm That Provides Fast Model Convergence in Distributed Deep Learning
Y Ko, SW Kim - Applied Sciences, 2021 - mdpi.com
The recent unprecedented success of deep learning (DL) in various fields is underlied by its
use of large-scale data and models. Training a large-scale deep neural network (DNN) …
use of large-scale data and models. Training a large-scale deep neural network (DNN) …
Shuffle Private Decentralized Convex Optimization
L Zhang, H Zhang - IEEE Transactions on Information …, 2024 - ieeexplore.ieee.org
In this paper, we consider the distributed local stochastic gradient descent (SGD) algorithm
by parallelizing multiple devices in the setting of stochastic convex optimization (SCO). The …
by parallelizing multiple devices in the setting of stochastic convex optimization (SCO). The …
LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster
F Yao, Z Zhang, Z Ji, B Liu, H Gao - The Journal of Supercomputing, 2024 - Springer
As the cost of deep learning training increases, using heterogeneous GPU clusters is a
reasonable way to scale cluster resources to support distributed deep learning (DDL) tasks …
reasonable way to scale cluster resources to support distributed deep learning (DDL) tasks …
ZipLine: an optimized algorithm for the elastic bulk synchronous parallel model
The bulk synchronous parallel (BSP) is a celebrated synchronization model for general-
purpose parallel computing that has successfully been employed for distributed training of …
purpose parallel computing that has successfully been employed for distributed training of …