Accelerating distributed deep learning training with compression assisted allgather and reduce-scatter communication

Q Zhou, Q Anthony, L Xu, A Shafi… - 2023 IEEE …, 2023 - ieeexplore.ieee.org
Fully Sharded Data Parallel (FSDP) technology achieves higher performance by scaling out
data-parallel training of Deep Learning (DL) models. It shards the model parameters …

A multivariate and quantitative model for predicting cross-application interference in virtual environments

MM Alves, LM de Assumpção Drummond - Journal of Systems and …, 2017 - Elsevier
Cross-application interference can drastically affect performance of HPC applications
executed in clouds. The problem is caused by concurrent access of co-located applications …

Arrangements for communicating data in a computing system using multiple processors

JG Gonzalez, SA Fonseca, RC Nunez - US Patent 10,305,980, 2019 - Google Patents
Data employed in computations amongst multiple processors in a computing system is
processed so that less bits than a full representation of the data needs to be communicated …

Accelerating broadcast communication with gpu compression for deep learning workloads

Q Zhou, Q Anthony, A Shafi… - 2022 IEEE 29th …, 2022 - ieeexplore.ieee.org
With the rapidly increasing model sizes, state-of-the-art Deep Learning (DL) models rely on
multiple GPU nodes to run distributed training. Large message communication of GPU data …

Reducing communication costs in collective I/O in multi-core cluster systems with non-exclusive scheduling

K Cha, S Maeng - The Journal of Supercomputing, 2012 - Springer
As the number of nodes in high performance computing (HPC) systems increases, collective
I/O becomes an important issue and I/O aggregators are the key factors in improving the …

Arrangements for storing more data in faster memory when using a hierarchical memory structure

JG Gonzalez, SA Fonseca, RC Nunez - US Patent 10,114,554, 2018 - Google Patents
Data employed in computations is processed so that during computations more of the data
can be fit into or maintained in a smaller but higher speed memory than an original source of …

Optimizations to enhance sustainability of MPI applications

J Carretero, J Garcia-Blas, DE Singh, F Isaila… - Proceedings of the 21st …, 2014 - dl.acm.org
Ultrascale computing systems are likely to reach speeds of two or three orders of magnitude
greater than today's computing systems. However, to achieve this level of performance, we …

Using DCT-based approximate communication to improve MPI performance in parallel clusters

Q Fan, DJ Lilja, SS Sapatnekar - 2019 IEEE 38th International …, 2019 - ieeexplore.ieee.org
Communication overheads in distributed systems constitute a large fraction of the total
execution time, and limit the scalability of applications running on these systems. We …

A model of checkpoint behavior for applications that have I/O

B León, S Méndez, D Franco, D Rexachs… - The Journal of …, 2022 - Springer
Due to the increase and complexity of computer systems, reducing the overhead of fault
tolerance techniques has become important in recent years. One technique in fault tolerance …

A methodology for designing energy-aware systems for computational science

PC Cañizares, A Núñez, M Núñez, JJ Pardo - Procedia Computer Science, 2015 - Elsevier
Energy consumption is currently one of the main issues in large distributed systems. More
specifically, the efficient management of energy without losing performance has become a …