Implementation and performance analysis of non-blocking collective operations for MPI

T Hoefler, A Lumsdaine, W Rehm - Proceedings of the 2007 ACM/IEEE …, 2007 - dl.acm.org
Collective operations and non-blocking point-to-point operations have always been part of
MPI. Although non-blocking collective operations are an obvious extension to MPI, there …

Message progression in parallel computing-to thread or not to thread?

T Hoefler, A Lumsdaine - 2008 IEEE International Conference …, 2008 - ieeexplore.ieee.org
Message progression schemes that enable communication and computation to be
overlapped have the potential to improve the performance of parallel applications. With …

Quantifying the potential benefit of overlapping communication and computation in large-scale scientific applications

JC Sancho, KJ Barker, DJ Kerbyson… - Proceedings of the 2006 …, 2006 - dl.acm.org
The design and implementation of a high performance communication network are critical
factors in determining the performance and cost-effectiveness of a largescale computing …

Netgauge: A network performance measurement framework

T Hoefler, T Mehlan, A Lumsdaine, W Rehm - … and Communications: Third …, 2007 - Springer
This paper introduces Netgauge, an extensible open-source framework for implementing
network benchmarks. The structure of Netgauge abstracts and explicitly separates …

An OpenCL framework for heterogeneous multicores with local memory

J Lee, J Kim, S Seo, S Kim, J Park, H Kim… - Proceedings of the 19th …, 2010 - dl.acm.org
In this paper, we present the design and implementation of an Open Computing Language
(OpenCL) framework that targets heterogeneous accelerator multicore architectures with …

CAMP: fast and efficient IP lookup architecture

S Kumar, M Becchi, P Crowley, J Turner - Proceedings of the 2006 ACM …, 2006 - dl.acm.org
A large body of research literature has focused on improving the performance of longest
prefix match IP-lookup. More recently, embedded memory based architectures have been …

Shared memory programming for large scale machines

C Barton, CĆ Casçaval, G Almási, Y Zheng… - ACM SIGPLAN …, 2006 - dl.acm.org
This paper describes the design and implementation of a scalable run-time system and an
optimizing compiler for Unified Parallel C (UPC). An experimental evaluation on …

Optimizing the Use of Static Buffers for DMA on a CELL Chip

T Chen, Z Sura, K O'Brien, JK O'Brien - … Orleans, LA, USA, November 2-4 …, 2007 - Springer
The CELL architecture has one Power Processor Element (PPE) core, and eight Synergistic
Processor Element (SPE) cores that have a distinct instruction set architecture of their own …

Towards ultra-high resolution models of climate and weather

M Wehner, L Oliker, J Shalf - The International Journal of …, 2008 - journals.sagepub.com
We present a speculative extrapolation of the performance aspects of an atmospheric
general circulation model to ultra-high resolution and describe alternative technological …

Optimizing non-blocking collective operations for InfiniBand

T Hoefler, A Lumsdaine - 2008 IEEE International Symposium …, 2008 - ieeexplore.ieee.org
Non-blocking collective operations have recently been shown to be a promising
complementary approach for overlapping communication and computation in parallel …