The “ummunotify” functionality was been added to the “-mm” Linux kernel tree yesterday.
/me does a happy dance
Granted, getting into the -mm tree doesn’t guarantee anything about getting into Linus’ tree. But it’s definitely steps in the right direction.
Let me tell you why this is a Big Deal: memory management of networks based on OS-bypass techniques are a nightmare. Ummunotify makes it slightly less of a nightmare. This is good for MPI implementations and good for real-world MPI applications.
Until ummunotify, Linux MPI implementations have to do one of various ugly tricks in order to intercept the freeing of registered memory.
“What’s registered memory?” you ask.
I’m glad you asked. One of the premises of OS-bypass networks is that the NIC moves data in and out of memory without the OS’s knowledge. To do this properly, the MPI implementation must first “pin” the virtual memory in the OS first. Pinning tells the OS, “don’t move or page out this virtual memory” — it allows the NIC to read/write to that memory without fear of the OS moving it during the middle of an operation.
“What’s this got to do with ummunotify?” you ask.
Let me tell you. ”Pinning” memory is required before calling the underlying network hardware for sending or receiving, but it can be slow. It’s an old MPI implementation optimization to “pin” memory the first time it sees a given buffer in an MPI_SEND (or equivalent). The next time the MPI implementation sees the same buffer, it can know that it’s already pinned and it can just initiate the underlying network hardware immediately. In other words: the first MPI_SEND on a given buffer is “slow” — subsequent sends are “fast”.
“And that’s related to ummunotify… how?” you ask.
Right, I’m getting there. The problem is that MPI implementations have to track all the buffers that they’ve seen and pinned. So when an application calls MPI_SEND, the MPI typically looks up that buffer in some kind of hash table to see if it’s already pinned. But the important point here is that MPI (usually) does not control the allocation (and subsequent freeing) of buffers that are used for sending. So if an application mallocs a buffer and calls MPI_SEND on it, the MPI implementation may pin it. But then the application may free() that buffer — but a) it’s still pinned, and b) it’s still listed in the MPI’s hash table of “already pinned” addresses.
“Ewwww!!” you say.
‘zactly. But it’s even worse. The application may malloc() a different buffer and may get exactly the same virtual address back, but have it correspond to a different physical address. If the application calls MPI_SEND with that buffer, MPI will think it’s exactly the same buffer — it has no way of knowing that the physical address behind that virtual address is different than it was before. Yow!
“But wait, Uncle Jeff. How does ummunotify solve this?” you ask.
I’m glad you asked. Currently, MPI implementations use one (or more) gross, disgusting hacks to track when memory is freed so that they can know when to update their hash tables. Ummunotify gives user-level code visibility into the kernel MMU notification system, meaning that MPI middleware can be notified when memory is given back to the OS. It allows an MPI implementation to build a robust, reliable system to ensure that the “already pinned buffer” cache contains “good” virtual addresses.
(clipart used by permission of arthursclipart.org)