High-level synthesis hardware design for fpga-based accelerators: Models, methodologies, and frameworks

RS Molina, V Gil-Costa, ML Crespo, G Ramponi - IEEE Access, 2022 - ieeexplore.ieee.org
Hardware accelerators based on field programmable gate array (FPGA) and system on chip
(SoC) devices have gained attention in recent years. One of the main reasons is that these …

Distributed and deep vertical federated learning with big data

J Liu, X Zhou, L Mo, S Ji, Y Liao, Z Li… - Concurrency and …, 2023 - Wiley Online Library
In recent years, data are typically distributed in multiple organizations while the data security
is becoming increasingly important. Federated learning (FL), which enables multiple parties …

DLB: a dynamic load balance strategy for distributed training of deep neural networks

Q Ye, Y Zhou, M Shi, Y Sun, J Lv - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Synchronous strategies with data parallelism are widely utilized in distributed training of
Deep Neural Networks (DNNs), largely owing to their easy implementation yet promising …

Aladdin: Asymmetric centralized training for distributed deep learning

Y Ko, K Choi, H Jei, D Lee, SW Kim - Proceedings of the 30th ACM …, 2021 - dl.acm.org
To speed up the training of massive deep neural network (DNN) models, distributed training
has been widely studied. In general, a centralized training, a type of distributed training …

Virtualflow: Decoupling deep learning models from the underlying hardware

A Or, H Zhang, MN Freedman - Proceedings of Machine …, 2022 - proceedings.mlsys.org
We propose VirtualFlow, a system leveraging a novel abstraction called virtual node
processing to decouple the model from the hardware. In each step of training or inference …

FLSGD: free local SGD with parallel synchronization

Q Ye, Y Zhou, M Shi, J Lv - The Journal of Supercomputing, 2022 - Springer
Synchronous parameters algorithms with data parallelism have been successfully utilized to
accelerate the distributed training of deep neural networks (DNNs). However, a prevalent …

SHAT: A Novel Asynchronous Training Algorithm That Provides Fast Model Convergence in Distributed Deep Learning

Y Ko, SW Kim - Applied Sciences, 2021 - mdpi.com
The recent unprecedented success of deep learning (DL) in various fields is underlied by its
use of large-scale data and models. Training a large-scale deep neural network (DNN) …

Shuffle Private Decentralized Convex Optimization

L Zhang, H Zhang - IEEE Transactions on Information …, 2024 - ieeexplore.ieee.org
In this paper, we consider the distributed local stochastic gradient descent (SGD) algorithm
by parallelizing multiple devices in the setting of stochastic convex optimization (SCO). The …

LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster

F Yao, Z Zhang, Z Ji, B Liu, H Gao - The Journal of Supercomputing, 2024 - Springer
As the cost of deep learning training increases, using heterogeneous GPU clusters is a
reasonable way to scale cluster resources to support distributed deep learning (DDL) tasks …

ZipLine: an optimized algorithm for the elastic bulk synchronous parallel model

X Zhao, M Papagelis, A An, BX Chen, J Liu, Y Hu - Machine Learning, 2021 - Springer
The bulk synchronous parallel (BSP) is a celebrated synchronization model for general-
purpose parallel computing that has successfully been employed for distributed training of …