Overlap of communication and computation (part 2)

In part 1 of this series, I discussed various peer-wise technologies and techniques that MPI implementations typically use for communication / computation overlap.

MPI-3.0, published in 2012, forced a change in the overlap game.

Specifically: most prior overlap work had been in the area of individual messages between a pair of peers. These were very helpful for point-to-point messages, especially those of the non-blocking variety. But MPI-3.0 introduced the concept of non-blocking collective (NBC) operations. This fundamentally changed the requirements for network hardware offload.

Let me explain.

An MPI collective operation potentially includes more than two peers, and can have all manner of algorithms used to effect their implementation.

For example, consider a simple tree-based broadcast from a root to N peers. The root will send its message to M peers. Each of those peers will then send the message to M more peers. Rinse, repeat until the message has been delivered to all N peers.

In this case, something had to happen at each non-root MPI process before messages were sent on to the next set of MPI processes. This is different than typical splitting-initiation-and-completion overlap semantics: we must wait for a specific action before something else can occur.

Here’s some simplified pseudocode showing how a broadcast can be implemented:

if (I_have_a_parent)
    MPI_Recv(message, …, parent_rank, tag, comm, &status);
if (I_have_children)
    for (i = 0; i < num_children; ++i)
        MPI_Send(message, …, child_ranks[i], tag, comm);

That is, non-root MPI processes receive their parent, and if they have children of their own, send the message to each of them.

Question: How would this be offloaded to NIC hardware?
Answer: With existing overlap concepts, it can’t. Need to develop new overlap concepts.

The reason it can’t be offloaded to existing concepts is because they were focused on simple unconditional send and receive concepts — they had no concept of logic, conditional actions, or delayed operations.

Enter the concept of the triggered send.

A triggered send is pretty much exactly what it sounds like: a send that doesn’t occur until some event happens. As applied to our simple tree-based broadcast above, the “if I have children” block can be a set of triggered sends; the trigger can be the message receipt from the parent.

Fun note: if the NIC is powerful enough, the triggered sends can use the same (payload) buffer that it just received from the parent.

Hence, the whole tree-based broadcast can be offloaded to hardware, and CPU / NIC overlap is achieved. Huzzah!

There are tricky parts, of course, such as:

What happens when an MPI process with a parent in the broadcast tree is late to join the broadcast op, such that the message from the parent has already arrived (i.e., the trigger has already occurred)?
How expensive is it to setup triggered operations? (i.e., how much time does it take)
How fast can the hardware respond to triggered events?
How to handle resource exhaustion on the NIC when a triggered operation is only partially complete? (e.g., what if there are no send credits)
How well do these triggered operations scale (e.g., how many pending NBCs can be outstanding before a) resource exhaustion, and/or b) performance degradation)?

…and so on. This is still an area of active research.

The point here is that MPI-3’s NBC operations have changed the game, and introduced new ideas into how to achieve communication / computation overlap. More new ideas will undoubtedly emerge over time as researchers and vendors continue to explore this space.

This is an area to keep watching.