Cisco Blogs


Cisco Blog > High Performance Computing Networking

ummunotify hits the -mm kernel tree

May 11, 2010
at 12:00 pm PST

The “ummunotify” functionality was been added to the “-mm” Linux kernel tree yesterday.

/me does a happy dance

Granted, getting into the -mm tree doesn’t guarantee anything about getting into Linus’ tree.  But it’s definitely steps in the right direction.

Let me tell you why this is a Big Deal: memory management of networks based on OS-bypass techniques are a nightmare.  Ummunotify makes it slightly less of a nightmare.  This is good for MPI implementations and good for real-world MPI applications.

Until ummunotify, Linux MPI implementations have to do one of various ugly tricks in order to intercept the freeing of registered memory.

“What’s registered memory?” you ask.

I’m glad you asked.  One of the premises of OS-bypass networks is that the NIC moves data in and out of memory without the OS’s knowledge.  To do this properly, the MPI implementation must first “pin” the virtual memory in the OS first.  Pinning tells the OS, “don’t move or page out this virtual memory” — it allows the NIC to read/write to that memory without fear of the OS moving it during the middle of an operation.

“What’s this got to do with ummunotify?” you ask.

Let me tell you.  ”Pinning” memory is required before calling the underlying network hardware for sending or receiving, but it can be slow.  It’s an old MPI implementation optimization to “pin” memory the first time it sees a given buffer in an MPI_SEND (or equivalent).  The next time the MPI implementation sees the same buffer, it can know that it’s already pinned and it can just initiate the underlying network hardware immediately.  In other words: the first MPI_SEND on a given buffer is “slow” — subsequent sends are “fast”.

“And that’s related to ummunotify… how?” you ask.

Right, I’m getting there.  The problem is that MPI implementations have to track all the buffers that they’ve seen and pinned.  So when an application calls MPI_SEND, the MPI typically looks up that buffer in some kind of hash table to see if it’s already pinned.  But the important point here is that MPI (usually) does not control the allocation (and subsequent freeing) of buffers that are used for sending.  So if an application mallocs a buffer and calls MPI_SEND on it, the MPI implementation may pin it.  But then the application may free() that buffer — but a) it’s still pinned, and b) it’s still listed in the MPI’s hash table of “already pinned” addresses.

“Ewwww!!” you say.

‘zactly.  But it’s even worse.  The application may malloc() a different buffer and may get exactly the same virtual address back, but have it correspond to a different physical address.  If the application calls MPI_SEND with that buffer, MPI will think it’s exactly the same buffer — it has no way of knowing that the physical address behind that virtual address is different than it was before.  Yow!

“But wait, Uncle Jeff.  How does ummunotify solve this?” you ask.

I’m glad you asked.  Currently, MPI implementations use one (or more) gross, disgusting hacks to track when memory is freed so that they can know when to update their hash tables.  Ummunotify gives user-level code visibility into the kernel MMU notification system, meaning that MPI middleware can be notified when memory is given back to the OS.  It allows an MPI implementation to build a robust, reliable system to ensure that the “already pinned buffer” cache contains “good” virtual addresses.  

Yay!

(clipart used by permission of arthursclipart.org)

In an effort to keep conversations fresh, Cisco Blogs closes comments after 60 days. Please visit the Cisco Blogs hub page for the latest content.

3 Comments.


  1. I’m going to join Jeff’s happy dance !!! I really hope some day we will see it in default kernel. It is long waiting feature ….BTW Jeff does not say in the blog, that actually he is THE person that was pushing this effort last few years…

       0 likes

  2. how would something like this effect mixing MPI with other technologies that use pinned memory? I am thinking of memory transfers to and from accelerators as done in the CUDA toolkit.

       0 likes

  3. Jeff Squyres

    ummunotify is not tied to MPI, so it can be mixed with anything.Open MPI uses ummunotify as follows: * When Open MPI registers memory with OpenFabrics, it also sends a hey, I care about memory region X-Y”" down to ummunotify. * If Open MPI unregisters memory with OpenFabrics, it sends a “”I no longer care about memory region X-Y”" down to ummunotify. * When Open MPI is about to use its registered memory cache, it checks with ummunotify to see if anything has “”changed”" (there’s a quick/cheap way to do this). * If ummunotify indicates that memory mappings may have changed, Open MPI performs a more expensive operation to find out exactly what changed, and update its registration cache accordingly. * Open MPI then checks the registration cache for the buffer in question and proceeds with normal operation.As you can see, ummunotify has nothing to do with memory transfers or networking hardware — all it does is track when memory leaves a process. If you have previously registered “”I care about memory region X-Y”" with ummunotify, and then if some or all of region X-Y leaves your process, ummunotify will tell you about it. That’s all.But to directly answer the other part of your question, yes, ummunotify can be used by different parts of the same application without conflict. Hence, MPI can use ummunotify internally, and some other part of the process can use ummunotify (e.g., to monitor accelerator pinned memory).”

       0 likes