MPI: Messages, Not Streams

August 27, 2012 - 1 Comment

Periodically, new MPI developers get confused about MPI because they’re coming from environments where they’re used to dealing with streams for inter-process communication: TCP sockets, bi-directional POSIX pipes, etc.

Streams are simple flows of bytes. For example, when you write a 32-byte buffer down a TCP socket, it’s just an in-order sequence of bytes. When the receiver actually tries to receive the message, it may receive some, all, or none of those 32 bytes, depending on the receiver’s timing.

MPI presents a simpler abstration to applications: the application will receive nothing until it receives an entire incoming message.

Let me explain.

Note that many transports — including both TCP sockets and MPI — have blocking and non-blocking modes.

In blocking mode, both TCP and MPI wait for the entire incoming set of bytes to be received. Hence, in this case, they’re both roughly the same.

However, TCP still receives a simple sequence of bytes with an optional receiver-side memory map where to put those bytes. MPI, on the other hand, receives a message. A message is not just an atomic unit (representing some set of bytes); it also has type and memory layout information with it.

Hence, when you tell MPI to receive 37 integers, it actually will deliver 37 integers back to you (not just 37*sizeof(int) bytes). More generally, you can also tell MPI to receive a specific struct type — which may include memory “holes” between members. MPI can even receive an array of those structs.

The different is more pronounced in non-blocking mode. Applications that care about performance typically use non-blocking network modes because the network is usually several orders of magnitude slower than the main CPU. Hence, a best practice is to start a network communication (either a send or a receive), go do other CPU-based
work, and then check for completion of the network communication later.

Streams-based applications (e.g., those using TCP sockets) may not get the entire incoming message in one downcall because not all of the bytes may have been received by the network layer yet. Non-blocking streams-based receiving code may look something like this:

int ret, recvd_so_far = 0;
size_t len = 37 * sizeof(int);

while (recvd_so_far < len) {
    ret = read(fd, buf, len);
    if (recvd_so_far + red < len) {
        buf += ret;
        recvd_so_far += ret;
        /* Do some other work */

In MPI, there’s no need to keep attempting to receive new bytes. Instead, an application tells MPI what message to receive, and then can either poll repeatedly or block waiting for completion. So there still may be a loop, but there’s no sense of internal accouning of how many bytes have been received (etc.) — the message has either arrived
or it has not. Perhaps something like this:

int flag;
MPI_Request request;
MPI_Irecv(buf, 37, MPI_INT, ..., &request);

do {
    MPI_Test(&request, &flag, MPI_STATUS_IGNORE);
    if (0 == flag) {
        /* Do some other work */
} while (0 == flag);

In short, the main differences between streams and messages are abstraction and convenience. Of course an application can effect message-like semantics over streams — we do this for TCP-based MPI implementations, after all. But the point is that the application shouldn’t have to deal with the added accounting and messiness of streams.

From an application’s point of view, messages are simpler to handle.


In an effort to keep conversations fresh, Cisco Blogs closes comments after 60 days. Please visit the Cisco Blogs hub page for the latest content.


  1. I think you hit the nail on the head with “the application shouldn’t have to deal with the added accounting and messiness of streams”. Note that you could replace “streams” with “shared memory” or “RDMA networking”, etc. Those things are messy, and MPI is surprisingly good at hiding the mess while still delivering the performance.