Cisco Logo


High Performance Computing Networking

A friend approached me the other day asking what this Open MPI error message meant:

Memory <address> cannot be freed from the registration cache. Possible memory corruption.

Open MPI was displaying this message late in the application’s run — it was a pretty safe bet that when the message was printed, it was well after the actual error. “Memory corruption” is a word that sends shivers down developers’ spines.

The message itself is, unfortunately, not very helpful.  It turns out that this message is from Open MPI 1.4, which is now the prior production series of Open MPI.  We’ve recently deprecated it by releasing Open MPI v1.6.  And in v1.6, we replaced this error message with something much more helpful.

When my friend upgraded to Open MPI v1.6, he saw the following during MPI_FINALIZE:

Open MPI intercepted a call to free memory that is still being used by  an ongoing MPI communication. This usually reflects an error in the MPI application; it may signify memory corruption. Open MPI will now abort your job.

After which, Open MPI prints the hostname, address, and length of the buffer in question.

Note that this error message is an artifact of using registered memory.  And while this second message is much more helpful than the first, what does it really mean?

It took only moments for my friend to figure it out — “Oh… we forgot to TEST or WAIT on a request!”

Bingo.  What had happened was that the application called MPI_ALLOC_MEM to allocate some memory to use in a non-blocking communication (I forget whether it was a send or a receive).  But then they forgot to complete it with one of the variants of MPI_TEST or MPI_WAIT.

During MPI_FINALIZE, Open MPI tried to free (and deregister) the MPI_ALLOC_MEM’ed memory — but that memory was still being used by the request that had never formally been completed.  Cue error message, abort.

It then only took a few minutes to track down the errant request, put in an appropriate TEST or WAIT, and be up and running.

w00t!

The moral of the story here is that error messages are important.  Do you have any favorite helpful (or unhelpful!) error messages?  Let me know in the comments below.

In an effort to keep conversations fresh, Cisco Blogs closes comments after 90 days. Please visit the Cisco Blogs hub page for the latest content.

2 Comments.


  1. Bad error messages is one of my pet peeves.

    Why didn’t MPI_Finalize complain that the MPI_ALLOC_MEM buffer wasn’t MPI_FREE_MEM’d? That seems like the most helpful message in this particular case, as the app didn’t free a buffer it allocated. Then, the error for MPI_FREE_MEM would be what you returned from MPI_FINALIZE: the memory is still in use and cannot be freed.

    Outside of this particular case, returning an error from MPI_Finalize when requests are leaked would probably help developers too – anything that can help them write propper applications, really. The only worry would be that apps that used to work fine (and still work with other implementation) start failing in MPI_Finalize after an upgrade…

       0 likes

  2. There is no doubt that our error messages could be better — this post is evidence of that.

    Part of the issue is that there are *so many* error messages that it can daunting to know where to start.

    Additionally, Open MPI actually has a mode where it can notify you during MPI_FINALIZE of leaked MPI handles. But it requires extra overhead and resources, so it’s not enabled by default. It’s the classic balance of features vs. performance, and users have told us many times — right or wrong — they they tend to prefer performance over features (or, more specifically, features that don’t cost anything in performance).

       0 likes

  1. Return to Countries/Regions
  2. Return to Home
  1. All High Performance Computing Networking
  2. Return to Home