Accelerating distributed deep learning training with compression assisted allgather and reduce-scatter communication
Fully Sharded Data Parallel (FSDP) technology achieves higher performance by scaling out
data-parallel training of Deep Learning (DL) models. It shards the model parameters …
data-parallel training of Deep Learning (DL) models. It shards the model parameters …
A multivariate and quantitative model for predicting cross-application interference in virtual environments
MM Alves, LM de Assumpção Drummond - Journal of Systems and …, 2017 - Elsevier
Cross-application interference can drastically affect performance of HPC applications
executed in clouds. The problem is caused by concurrent access of co-located applications …
executed in clouds. The problem is caused by concurrent access of co-located applications …
Arrangements for communicating data in a computing system using multiple processors
JG Gonzalez, SA Fonseca, RC Nunez - US Patent 10,305,980, 2019 - Google Patents
Data employed in computations amongst multiple processors in a computing system is
processed so that less bits than a full representation of the data needs to be communicated …
processed so that less bits than a full representation of the data needs to be communicated …
Accelerating broadcast communication with gpu compression for deep learning workloads
With the rapidly increasing model sizes, state-of-the-art Deep Learning (DL) models rely on
multiple GPU nodes to run distributed training. Large message communication of GPU data …
multiple GPU nodes to run distributed training. Large message communication of GPU data …
Reducing communication costs in collective I/O in multi-core cluster systems with non-exclusive scheduling
K Cha, S Maeng - The Journal of Supercomputing, 2012 - Springer
As the number of nodes in high performance computing (HPC) systems increases, collective
I/O becomes an important issue and I/O aggregators are the key factors in improving the …
I/O becomes an important issue and I/O aggregators are the key factors in improving the …
Arrangements for storing more data in faster memory when using a hierarchical memory structure
JG Gonzalez, SA Fonseca, RC Nunez - US Patent 10,114,554, 2018 - Google Patents
Data employed in computations is processed so that during computations more of the data
can be fit into or maintained in a smaller but higher speed memory than an original source of …
can be fit into or maintained in a smaller but higher speed memory than an original source of …
Optimizations to enhance sustainability of MPI applications
Ultrascale computing systems are likely to reach speeds of two or three orders of magnitude
greater than today's computing systems. However, to achieve this level of performance, we …
greater than today's computing systems. However, to achieve this level of performance, we …
Using DCT-based approximate communication to improve MPI performance in parallel clusters
Communication overheads in distributed systems constitute a large fraction of the total
execution time, and limit the scalability of applications running on these systems. We …
execution time, and limit the scalability of applications running on these systems. We …
A model of checkpoint behavior for applications that have I/O
Due to the increase and complexity of computer systems, reducing the overhead of fault
tolerance techniques has become important in recent years. One technique in fault tolerance …
tolerance techniques has become important in recent years. One technique in fault tolerance …
A methodology for designing energy-aware systems for computational science
Energy consumption is currently one of the main issues in large distributed systems. More
specifically, the efficient management of energy without losing performance has become a …
specifically, the efficient management of energy without losing performance has become a …