A model and survey of distributed data-intensive systems

A Margara, G Cugola, N Felicioni, S Cilloni - ACM Computing Surveys, 2023 - dl.acm.org
Data is a precious resource in today's society, and it is generated at an unprecedented and
constantly growing pace. The need to store, analyze, and make data promptly available to a …

On the acceleration of deep learning model parallelism with staleness

A Xu, Z Huo, H Huang - … of the IEEE/CVF Conference on …, 2020 - openaccess.thecvf.com
Training the deep convolutional neural network for computer vision problems is slow and
inefficient, especially when it is large and distributed across multiple devices. The …

Improving resource utilization by timely fine-grained scheduling

T Jin, Z Cai, B Li, C Zheng, G Jiang… - Proceedings of the …, 2020 - dl.acm.org
Monotask is a unit of work that uses only a single type of resource (eg, CPU, network, disk
I/O). While monotask was primarily introduced as a means to reason about job performance …

Serverless end game: Disaggregation enabling transparency

P Garcia Lopez, A Slominski, B Metzler… - Proceedings of the 2nd …, 2024 - dl.acm.org
For many years, the distributed systems community has struggled to smooth the transition
from local to remote computing. Transparency means concealing the complexities of …

Adrias: Interference-aware memory orchestration for disaggregated cloud infrastructures

D Masouros, C Pinto, M Gazzetti… - … Symposium on High …, 2023 - ieeexplore.ieee.org
Workload co-location has become the de-facto approach for hosting applications in Cloud
environments, leading, however, to interference and fragmentation in shared resources of …

Step-ahead error feedback for distributed training with compressed gradient

A Xu, Z Huo, H Huang - Proceedings of the AAAI Conference on …, 2021 - ojs.aaai.org
Although the distributed machine learning methods can speed up the training of large deep
neural networks, the communication cost has become the non-negligible bottleneck to …

Hierarchical training: Scaling deep recommendation models on large CPU clusters

Y Huang, X Wei, X Wang, J Yang, BY Su… - Proceedings of the 27th …, 2021 - dl.acm.org
Neural network based recommendation models are widely used to power many internet-
scale applications including product recommendation and feed ranking. As the models …

Demystifying the QoS and QoE of Edge-hosted Video Streaming Applications in the Wild with SNESet

Y Li, G Deng, C Bai, J Yang, G Wang, H Zhang… - Proceedings of the …, 2023 - dl.acm.org
Video streaming applications (VSAs) are increasingly being deployed on large-scale edge
platforms, which have the potential to significantly improve the quality of service (QoS) and …

Cross Model Parallelism for Faster Bidirectional Training of Large Convolutional Neural Networks

A Xu, Y Bai - Joint European Conference on Machine Learning and …, 2023 - Springer
Large convolutional neural networks (CNNs) have been successful in data mining tasks, but
it is hard to train these large-scale models. Model parallelism (MP) places a large CNN to …

Distributed Adaptive Optimization with Divisible Communication

A Xu, Y Bai - Joint European Conference on Machine Learning and …, 2023 - Springer
Synchronous distributed training can scale the training of deep neural networks on large-
scale data, thus it has been widely adopted in large-scale applications. Because it often …