The anatomy of MPI implementation optimizations

MPI implementations are large, complex beasts.  By definition, they span many different layers ranging from the user-level implementation all the way down to sending 1s and 0s across a physical network medium.

However, not all MPI implementations actually “own” the code at every level.  Consider: a TCP-based MPI implementation only “owns” the user-level middleware.  It cannot see or change anything in the TCP stack (or below).  Such an implementation is limited to optimizations at the user space level.

That being said, there certainly are many optimizations possible at the user level.  In fact, user space is probably where the largest number of optimizations are typically possible.  Indeed, nothing can save your MPI_BCAST performance if you’re using a lousy broadcast algorithm.

However, lower-layer optimizations are just as important, and can deliver many things that simply cannot be effected from user space.

If not RDMA, then what?

In prior blog posts, I talked about some of the challenges that are associated with implementing MPI over RMA- or RDMA-based networks.  The natural question then becomes, “What’s the alternative?”

There’s at least two general classes of alternatives:

  • General purpose networks (e.g., Ethernet — perhaps using TCP/IP or even UDP)
  • Special purpose networks (i.e., built specifically for MPI)

This doesn’t even mention shared memory, but let’s return to shared memory as an MPI transport in a future post.

Euro MPI 2011 Call for Pariticpation

WHAT: EuroMPI 2011 Conference
WHERE: Santorini, Greece
WHEN: September 18-21, 2011


EuroMPI is the primary meeting where the users and developers of MPI and other message-passing programming environments can interact. The 18th European MPI Users’ Group Meeting will be a forum for the users and developers of MPI, but also welcome hybrid programing models that combine message passing with programming of modern architectures such as multi-core, or accelerators.

Through the presentation of contributed papers, poster presentations and invited talks, attendees will have the opportunity to share ideas and experiences to contribute to the improvement and furthering of message-passing and related parallel programming paradigms.

Registered Memory (RMA / RDMA) and MPI implementations

In a prior blog post, I talked about RMA (and RDMA) networks, and what they mean to MPI implementations.  In this post, I’ll talk about one of the consequences of RMA networks: registered memory.

Registered memory is something that most HPC administrators and users have at least heard of, but may not fully understand.

Let me clarify it for you: registered memory is both a curse and a blessing.

It’s more of the former than the latter, if you ask me, but MPI implementations need to use (and track) registered memory to get high performance on today’s high-performance networking API stacks.

“RDMA” — what does it mean to MPI applications?

RDMA standard for Remote Direct Memory Access.  The acronym is typically associated with OpenFabrics networks such as iWARP, IBoIP (a.k.a. RoCE), and InfiniBand.  But “RDMA” is typically just today’s popular flavor du jour of a more general concept: RMA (remote memory access), or directly reading and writing to a peer’s memory space.

RMA implementations (including RDMA-based networks, such as OpenFabrics) typically include one or more of the following technologies:

  1. Operating system bypass: userspace applications directly communicate with network hardware.
  2. Hardware offload: network activity is driven by the NIC, not the main CPU
  3. Hardware or software notification: when messages finish sending or are received

How are these technologies typically used in MPI implementations?

