Overlap of communication and computation (part 1)

I’ve mentioned computation / communication overlap before (e.g., here, here, and here).

Various types of networks and NICs have long-since had some form of overlap. Some had better quality overlap than others, from an HPC perspective.

But with MPI-3, we’re really entering a new realm of overlap. In this first of two blog entries, I’ll explain some of the various flavors of overlap and how they are beneficial to MPI/HPC-style applications.

First, let’s discuss exactly what is meant by “traditional” HPC communication/computation overlap. Here’s an example:

CPU: Hello NIC, dear chap. I’ve got this message to send to my good friend over at XYZ. Could you send it for me?
NIC: Why certainly, my good fellow! I’ll take care of everything for you. Check back in a bit; I’ll tell you when I’ve finished sending it.
CPU: Excellent! I’ll nip off and compute some FFTs while you’re sending, and check with you later.
…time passes…
CPU: I’ve finished some FFTs. NIC, old fellow; have you finished sending yet?
NIC: Not quite; check back momentarily.
…more time passes…
CPU: I’ve finished a few more FFTs. NIC, old fellow; have you finished sending yet?
NIC: Right! All finished sending. Here’s your receipt.
CPU: Smashing!

Notice how the CPU goes off and does other things while NIC handles all the details of sending. In other words: CPU has offloaded the task of sending the message to NIC.

Usually, offloading typically works best for “long” messages — where the cost of starting the send and (potentially repeated) checking for its completion later are less than the cost of actually sending the message itself. Put differently: there is a substantial amount of time between the initiation of the send and the completion of the send.

That being said, remember that CPUs typically operate at much higher frequencies than NIC processors. Hence, if an application is sending a large number of short messages in a small amount of time, the cost of separating the initiation and completion of the short messages no longer matters.

What matters is that there is a lengthy amount of time between when the last send is initiated and when it has actually finished sending (because the send queue got backed up by all the prior sends). Hence, the overlap in this case is the sending lots of little messages (vs. one big message).

In this lengthy period of time, the CPU can go off and do other, non-network-ish things (e.g., compute FFTs). Hence, offloading to the NIC can be much more efficient and timely in this case than relying on the CPU to continually check when it is possible to initiate the next (short) send.

There are lots of variations on this basic idea, too. For example, an MPI implementation may simply split the initiation and completion of sends simply to centralize common processing.

Or maybe the MPI implementation intentionally lazily reaps send completions so that it can perform other time-critical functions first.

…and so on.

There’s other ways of amortizing the cost of offloading, too. For example, in the above case of sending a large number of small messages, the MPI can ask for a completion notification from only the last message sent. When that notification is received, the MPI knows that all the prior messages have completed sending, too. This can be significantly more efficient than receiving a notification for each short message.

Some NICs allow the enqueueing of a large message and will automatically fragment it before sending the individual packets out. And if you combine that capability with direct data placement (DDP) at the peer, you have what is commonly referred to as Remote Direct Memory Access (RDMA).

All of these point-to-point overlap technologies (and more!) have been in common use for years. And they’re great.

But MPI-3 came along and added more requirements. More creative types of overlap are now required. Stay tuned.