Implementation and performance analysis of non-blocking collective operations for MPI
T Hoefler, A Lumsdaine, W Rehm - Proceedings of the 2007 ACM/IEEE …, 2007 - dl.acm.org
Collective operations and non-blocking point-to-point operations have always been part of
MPI. Although non-blocking collective operations are an obvious extension to MPI, there …
MPI. Although non-blocking collective operations are an obvious extension to MPI, there …
Message progression in parallel computing-to thread or not to thread?
T Hoefler, A Lumsdaine - 2008 IEEE International Conference …, 2008 - ieeexplore.ieee.org
Message progression schemes that enable communication and computation to be
overlapped have the potential to improve the performance of parallel applications. With …
overlapped have the potential to improve the performance of parallel applications. With …
Quantifying the potential benefit of overlapping communication and computation in large-scale scientific applications
JC Sancho, KJ Barker, DJ Kerbyson… - Proceedings of the 2006 …, 2006 - dl.acm.org
The design and implementation of a high performance communication network are critical
factors in determining the performance and cost-effectiveness of a largescale computing …
factors in determining the performance and cost-effectiveness of a largescale computing …
Netgauge: A network performance measurement framework
T Hoefler, T Mehlan, A Lumsdaine, W Rehm - … and Communications: Third …, 2007 - Springer
This paper introduces Netgauge, an extensible open-source framework for implementing
network benchmarks. The structure of Netgauge abstracts and explicitly separates …
network benchmarks. The structure of Netgauge abstracts and explicitly separates …
An OpenCL framework for heterogeneous multicores with local memory
In this paper, we present the design and implementation of an Open Computing Language
(OpenCL) framework that targets heterogeneous accelerator multicore architectures with …
(OpenCL) framework that targets heterogeneous accelerator multicore architectures with …
CAMP: fast and efficient IP lookup architecture
A large body of research literature has focused on improving the performance of longest
prefix match IP-lookup. More recently, embedded memory based architectures have been …
prefix match IP-lookup. More recently, embedded memory based architectures have been …
Shared memory programming for large scale machines
C Barton, CĆ Casçaval, G Almási, Y Zheng… - ACM SIGPLAN …, 2006 - dl.acm.org
This paper describes the design and implementation of a scalable run-time system and an
optimizing compiler for Unified Parallel C (UPC). An experimental evaluation on …
optimizing compiler for Unified Parallel C (UPC). An experimental evaluation on …
Optimizing the Use of Static Buffers for DMA on a CELL Chip
T Chen, Z Sura, K O'Brien, JK O'Brien - … Orleans, LA, USA, November 2-4 …, 2007 - Springer
The CELL architecture has one Power Processor Element (PPE) core, and eight Synergistic
Processor Element (SPE) cores that have a distinct instruction set architecture of their own …
Processor Element (SPE) cores that have a distinct instruction set architecture of their own …
Towards ultra-high resolution models of climate and weather
We present a speculative extrapolation of the performance aspects of an atmospheric
general circulation model to ultra-high resolution and describe alternative technological …
general circulation model to ultra-high resolution and describe alternative technological …
Optimizing non-blocking collective operations for InfiniBand
T Hoefler, A Lumsdaine - 2008 IEEE International Symposium …, 2008 - ieeexplore.ieee.org
Non-blocking collective operations have recently been shown to be a promising
complementary approach for overlapping communication and computation in parallel …
complementary approach for overlapping communication and computation in parallel …