A model and survey of distributed data-intensive systems
Data is a precious resource in today's society, and it is generated at an unprecedented and
constantly growing pace. The need to store, analyze, and make data promptly available to a …
constantly growing pace. The need to store, analyze, and make data promptly available to a …
On the acceleration of deep learning model parallelism with staleness
Training the deep convolutional neural network for computer vision problems is slow and
inefficient, especially when it is large and distributed across multiple devices. The …
inefficient, especially when it is large and distributed across multiple devices. The …
Improving resource utilization by timely fine-grained scheduling
Monotask is a unit of work that uses only a single type of resource (eg, CPU, network, disk
I/O). While monotask was primarily introduced as a means to reason about job performance …
I/O). While monotask was primarily introduced as a means to reason about job performance …
Serverless end game: Disaggregation enabling transparency
For many years, the distributed systems community has struggled to smooth the transition
from local to remote computing. Transparency means concealing the complexities of …
from local to remote computing. Transparency means concealing the complexities of …
Adrias: Interference-aware memory orchestration for disaggregated cloud infrastructures
D Masouros, C Pinto, M Gazzetti… - … Symposium on High …, 2023 - ieeexplore.ieee.org
Workload co-location has become the de-facto approach for hosting applications in Cloud
environments, leading, however, to interference and fragmentation in shared resources of …
environments, leading, however, to interference and fragmentation in shared resources of …
Step-ahead error feedback for distributed training with compressed gradient
Although the distributed machine learning methods can speed up the training of large deep
neural networks, the communication cost has become the non-negligible bottleneck to …
neural networks, the communication cost has become the non-negligible bottleneck to …
Hierarchical training: Scaling deep recommendation models on large CPU clusters
Neural network based recommendation models are widely used to power many internet-
scale applications including product recommendation and feed ranking. As the models …
scale applications including product recommendation and feed ranking. As the models …
Demystifying the QoS and QoE of Edge-hosted Video Streaming Applications in the Wild with SNESet
Y Li, G Deng, C Bai, J Yang, G Wang, H Zhang… - Proceedings of the …, 2023 - dl.acm.org
Video streaming applications (VSAs) are increasingly being deployed on large-scale edge
platforms, which have the potential to significantly improve the quality of service (QoS) and …
platforms, which have the potential to significantly improve the quality of service (QoS) and …
Cross Model Parallelism for Faster Bidirectional Training of Large Convolutional Neural Networks
A Xu, Y Bai - Joint European Conference on Machine Learning and …, 2023 - Springer
Large convolutional neural networks (CNNs) have been successful in data mining tasks, but
it is hard to train these large-scale models. Model parallelism (MP) places a large CNN to …
it is hard to train these large-scale models. Model parallelism (MP) places a large CNN to …
Distributed Adaptive Optimization with Divisible Communication
A Xu, Y Bai - Joint European Conference on Machine Learning and …, 2023 - Springer
Synchronous distributed training can scale the training of deep neural networks on large-
scale data, thus it has been widely adopted in large-scale applications. Because it often …
scale data, thus it has been widely adopted in large-scale applications. Because it often …