Dynamic-CoMPI: Dynamic optimization techniques for MPI parallel applications

Accelerating distributed deep learning training with compression assisted allgather and reduce-scatter communication

Q Zhou, Q Anthony, L Xu, A Shafi… - 2023 IEEE …, 2023 - ieeexplore.ieee.org

Fully Sharded Data Parallel (FSDP) technology achieves higher performance by scaling out
data-parallel training of Deep Learning (DL) models. It shards the model parameters …

被引用次数：9 相关文章所有 3 个版本

A multivariate and quantitative model for predicting cross-application interference in virtual environments

MM Alves, LM de Assumpção Drummond - Journal of Systems and …, 2017 - Elsevier

Cross-application interference can drastically affect performance of HPC applications
executed in clouds. The problem is caused by concurrent access of co-located applications …

被引用次数：28 相关文章所有 3 个版本

[PDF] googleapis.com

Arrangements for communicating data in a computing system using multiple processors

JG Gonzalez, SA Fonseca, RC Nunez - US Patent 10,305,980, 2019 - Google Patents

Data employed in computations amongst multiple processors in a computing system is
processed so that less bits than a full representation of the data needs to be communicated …

被引用次数：16 相关文章所有 2 个版本

[PDF] nsf.gov

Accelerating broadcast communication with gpu compression for deep learning workloads

Q Zhou, Q Anthony, A Shafi… - 2022 IEEE 29th …, 2022 - ieeexplore.ieee.org

With the rapidly increasing model sizes, state-of-the-art Deep Learning (DL) models rely on
multiple GPU nodes to run distributed training. Large message communication of GPU data …

被引用次数：7 相关文章所有 3 个版本

Reducing communication costs in collective I/O in multi-core cluster systems with non-exclusive scheduling

K Cha, S Maeng - The Journal of Supercomputing, 2012 - Springer

As the number of nodes in high performance computing (HPC) systems increases, collective
I/O becomes an important issue and I/O aggregators are the key factors in improving the …

被引用次数：28 相关文章所有 9 个版本

[PDF] googleapis.com

Arrangements for storing more data in faster memory when using a hierarchical memory structure

JG Gonzalez, SA Fonseca, RC Nunez - US Patent 10,114,554, 2018 - Google Patents

Data employed in computations is processed so that during computations more of the data
can be fit into or maintained in a smaller but higher speed memory than an original source of …

被引用次数：11 相关文章所有 2 个版本

[PDF] ucd.ie

Optimizations to enhance sustainability of MPI applications

J Carretero, J Garcia-Blas, DE Singh, F Isaila… - Proceedings of the 21st …, 2014 - dl.acm.org

Ultrascale computing systems are likely to reach speeds of two or three orders of magnitude
greater than today's computing systems. However, to achieve this level of performance, we …

被引用次数：11 相关文章所有 4 个版本

[PDF] umn.edu

Using DCT-based approximate communication to improve MPI performance in parallel clusters

Q Fan, DJ Lilja, SS Sapatnekar - 2019 IEEE 38th International …, 2019 - ieeexplore.ieee.org

Communication overheads in distributed systems constitute a large fraction of the total
execution time, and limit the scalability of applications running on these systems. We …

被引用次数：6 相关文章所有 10 个版本

[PDF] springer.com

A model of checkpoint behavior for applications that have I/O

B León, S Méndez, D Franco, D Rexachs… - The Journal of …, 2022 - Springer

Due to the increase and complexity of computer systems, reducing the overhead of fault
tolerance techniques has become important in recent years. One technique in fault tolerance …

被引用次数：1 相关文章所有 9 个版本

[PDF] sciencedirect.com

A methodology for designing energy-aware systems for computational science

PC Cañizares, A Núñez, M Núñez, JJ Pardo - Procedia Computer Science, 2015 - Elsevier

Energy consumption is currently one of the main issues in large distributed systems. More
specifically, the efficient management of energy without losing performance has become a …

被引用次数：12 相关文章所有 7 个版本