Why MPI is Good for You (part 3)

I’ve previously posted on “Why MPI is Good for You” (blog tag: why-mpi-is-good-for-you). The short version is that it hides the typical application programmer from lots and lots of underlying network stuff; stuff that they really, really don’t want to be involved in.

Here’s another case study…

Cisco’s upcoming ultra-low latency MPI transport is implemented over an “unreliable” transport: raw Ethernet L2 frames. For latency reasons, it’s using the OpenFabrics verbs operating-system bypass API. These two facts mean that a) userspace is directly talking to the NIC hardware, and b) we don’t have a driver thread running down in the kernel that can service incoming frames regardless of what the MPI application is doing.

As a direct result, Open MPI itself has to handle all the retransmission and ACK issues. This is NOT something an application developer should ever need to (directly) deal with!

Here’s an excellent example why. The following is a graph showing the latency from the IMB “sendrecv” (bidirectional bandwidth) benchmark on a pair of Intel E5-2690-based servers connected by a single low-latency Ethernet switch, each with 1 Cisco VIC port active. The two lines on this graph reflect two different transport implementation schemes, which I’ll explain below (lower is better):

The graph starts at message size=10,000 bytes.

So what are those spikes? Note that the spikes don’t occur until “large-ish” messages (around 80K).

First, let me explain the lines on the graph: they’re two different schemes for handling the receipt of large messages that were split across many frames.

“No immediate put ACK”: shows the receiver deferring sending back the last ACK (indicating that it received all the frames of the large message).
“Immediate put ACK”: shows immediately sending the last ACK back to the sender.

Although it may seem counter-intuitive, deferring ACKs can result in lower round-trip latency. Specifically, in applications that send large messages frequently, the “no immediate put ACK” scheme allows us to defer sending the ACK until the next pass through our progression engine, enabling us to piggy-back the ACK on the receiver’s next send back to the sender. This means we send fewer frames, which both reduces our overall short message latency and is an overall Good Thing.

Sidenote: keep in mind the difference between needing to send network ACKs and actual MPI completion semantics! Network ACKs, in this case, are only necessary to prevent retransmission; they are NOT directly related to the receiver’s MPI completion semantics.

What this data shows us is that our defer-to-the-next-progression-cycle scheme is not effective for the last incoming frame of a large message. Specifically, the next progression cycle may not be immediately. MPI may return from the receive before we’re able to send the ACK, effectively inserting a large delay before the ACK is finally sent. This has a corresponding delaying effect on the sender’s MPI completion semantics.

Sad panda.

So we changed our scheme: the “immediate put ACK” line reflects us sending back the last ACK immediately upon receipt of all the frames of a large message without waiting for the next progression cycle.

You can see the huge improvement. Notice, too, what happens on the bidirectional bandwidth for the same test (higher is better):

Also notice how smooth both the latency and bandwidth lines got after this improvement.

Happy panda!

This is exactly the type of underlying network stuff that a well-tuned MPI implementation does an excellent job of hiding from the application developer.

MPI is good for you.