MPI_REQUEST_FREE is Evil

It was pointed out to me that in my last blog post (Don’t leak MPI_Requests), I failed to mention the MPI_REQUEST_FREE function.

True enough — I did fail to mention it. But I did so on purpose, because MPI_REQUEST_FREE is evil.

Let me explain…

MPI_REQUEST_FREE is described in MPI-3 section 3.7.3 as:

Mark the request object for deallocation and set request to MPI_REQUEST_NULL. An ongoing communication that is associated with the request will be allowed to complete. The request will be deallocated only after its completion.

Sounds like a pretty good way to fire a non-blocking send and forget about it, right? You could do something like this:

MPI_Isend(buffer, …, &request);
MPI_Request_free(&request);

Looks great!

…except that it isn’t. 🙁

What becomes an issue is knowing when the send has completed — when is it safe to edit or free the send buffer? The usual argument is that you can tell when the send finishes by having the receiver send an ACK back to the sender when the message has been received.

Believe it or not, that is not sufficient!

Just because the message has been received at the far side does not mean that the sending side has completed all of its internal accounting and stopped using the send buffer.

To be totally clear: MPI may still be using the send buffer, even though the message has been received at the destination.

That may seem totally counter-intuitive, but it’s true.

In the above example, consider if the MPI implementation needs to register the buffer before sending it. Remember that registering memory takes time; it’s “slow”. De-registering memory is even slower. So even after the local network hardware has completed sending the send buffer, the MPI implementation may choose to de-register the memory — which not only takes time, it also likely involves updating critical state in internal MPI implementation tables.

And since the request was freed, the MPI implementation may conclude that completing all work associated with this request is (very) low priority. Since the app doesn’t care about this request any more, why not complete all other (non-freed) requests first?

Additionally, this de-registration / updating work may be occurring asynchronously in the user’s application — perhaps even outside of the main application thread.

Hence, even if the main application thread receives an ACK from the receiver, the work such as de-registration may not yet be complete. Freeing the memory before that deregistration / updating work is complete could be catastrophic.

The fact of the matter is: if you free an ongoing send request, you have only one guarantee as to when MPI will be finished with that buffer: when MPI_FINALIZE completes.

A similar situation occurs with freeing non-blocking receive requests. How will you know for sure when a) the message has been fully received, and b) the MPI implementation is finished with the receive buffer? Remember that only matching of MPI requests is ordered — you can’t know if a message has been fully received, even if a subsequent message of the same signature has been matched.

The moral of the story: only MPI_REQUEST_FREE non-blocking requests if you don’t care about doing anything with the buffer until after MPI_FINALIZE returns.