Do you read MPI error messages?
A friend approached me the other day asking what this Open MPI error message meant:
Memory <address> cannot be freed from the registration cache. Possible memory corruption.
Open MPI was displaying this message late in the application’s run — it was a pretty safe bet that when the message was printed, it was well after the actual error. “Memory corruption” is a word that sends shivers down developers’ spines.
The message itself is, unfortunately, not very helpful. It turns out that this message is from Open MPI 1.4, which is now the prior production series of Open MPI. We’ve recently deprecated it by releasing Open MPI v1.6. And in v1.6, we replaced this error message with something much more helpful.
When my friend upgraded to Open MPI v1.6, he saw the following during MPI_FINALIZE:
Open MPI intercepted a call to free memory that is still being used by an ongoing MPI communication. This usually reflects an error in the MPI application; it may signify memory corruption. Open MPI will now abort your job.
After which, Open MPI prints the hostname, address, and length of the buffer in question.
Note that this error message is an artifact of using registered memory. And while this second message is much more helpful than the first, what does it really mean?
It took only moments for my friend to figure it out — “Oh… we forgot to TEST or WAIT on a request!”
Bingo. What had happened was that the application called MPI_ALLOC_MEM to allocate some memory to use in a non-blocking communication (I forget whether it was a send or a receive). But then they forgot to complete it with one of the variants of MPI_TEST or MPI_WAIT.
During MPI_FINALIZE, Open MPI tried to free (and deregister) the MPI_ALLOC_MEM’ed memory — but that memory was still being used by the request that had never formally been completed. Cue error message, abort.
It then only took a few minutes to track down the errant request, put in an appropriate TEST or WAIT, and be up and running.
The moral of the story here is that error messages are important. Do you have any favorite helpful (or unhelpful!) error messages? Let me know in the comments below.