From my blog post about the upcoming MPI-3 draft spec, Geoffrey Irving asked a followup question:
Can you describe the difference between the current situation and true background progression? Does the lack of background progression mean having to occasionally explicitly relinquish control to MPI in order to let one-sided operations proceed? Once true background progression is in place, would it involve extra threads and context switching, or use some other mechanism?
A great question. He asked it in the context of the new MPI-3 one-sided stuff, but it’s generally applicable for other MPI operations, too (even MPI_SEND / MPI_RECV).
Progressing of long-running MPI operations, such as a non-blocking send of a long message, is a difficult thing. There’s what the MPI standard says, and then there’s what (typically) happens in reality.
The MPI-2.2 document says the following in section 3.5 (my emphasis added):
If a pair of matching send and receives have been initiated on two processes, then at least one of these two operations will complete, independently of other actions in the system: the send operation will complete, unless the receive is satisfied by another message, and completes; the receive operation will complete, unless the message sent is consumed by another matching receive that was posted at the same destination process.
There is no mention of how the message passing progress occurs — it just says that it must occur.
Specifically, it doesn’t say that an MPI implementation may rely on calls to MPI functions to trip the internal progress engine. Most MPI implementations use this strategy for a cheap/easy way to keep progress occurring, but the above text make it pretty clear that an MPI implementation is not allowed to solely rely on such mechanisms for ultimate completion.
However, there is the law, and then there is the letter of the law. :-)
Many MPI implementations do not have fully asynchronous progress for all cases. For example, having a progress thread running in the background can (severely) negatively impact latency and/or resource consumption. For example, an asynchronous thread will require locks into the critical message passing code paths, potentially thrash caches, incur context switching costs, consume more resources, etc. The effects of progress threads — particularly for short messages — are… complex, at best.
However, progress threads have been shown to be an acceptable way to get some types of asynchronous progress for large messages, particularly when most of the work is handled by hardware — not software. When dealing with large messages, the added costs of context switching (etc.) don’t matter as much.
Looking at it in one way: MPI implementations have weaseled out of doing the hard work of true asynchronous progression. But looking at it another way, MPI implementations have stayed out of that (very) difficult feature because it typically adds to short message latency — which is one of the first metrics that anyone looks at in an MPI implementation.
If your MPI implementation has high short message latency, no one will care if you have true asynchronous progress. That’s a cynical statement, but it’s true (some MPI implementations went out of business because it’s true!).
All that being said, modern networking and co-processor hardware can help in many cases. For example, an MPI implementation can (sometimes) hand off a long message to capable NICs and let the hardware handle the entire transmission (and/or receipt). Once the network action is complete, the hardware can notify the software so that the MPI layer can mark the corresponding MPI_Request as complete.
Hence, most MPI implementations try to co-opt hardware to assist as much as possible (e.g., for offloading message passing operations) and provide the cheapest way possible for honor the MPI progress rule.
But that’s usually still not enough. I know of (at least) one MPI implementation that raises SIGALARM at least once a day to asynchronously trip their progression engine.
But it honored the MPI progress rule. And that MPI implementation was still able to have low latency because it wasn’t encumbered by locks for progress threads, etc.
But let’s tie this back to the original question — why was Geoffrey asking about asynchronous progress in terms of MPI-3 one-sided stuff?
Because the MPI-3 one-sided working group people tell me that the new MPI-3.0 one-sided functionality will pretty much force the issue of asynchronous progress on all MPI implementations. I honestly don’t know the details (the new MPI-3.0 one-sided chapter scares me!), but I believe them.
Meaning: all of us MPI implementers are going to have to figure out how to do true asynchronous progress — including that of short messages — without adding latency. Yowzers.
Hope that helps explain things!