Better Linux memory tracking
Yesterday morning, we (Open MPI) entered what is hopefully a final phase of testing for a “better” implementation of the “leave registered” optimization for OpenFabrics networks. I briefly mentioned this work in a prior blog entry; it’s now finally coming to fruition. Woo hoo!
Roland Dreier has pushed a new Linux kernel module upstream for helping user-level applications track when memory leaves their process (it’s not guaranteed that this kernel module will be accepted, but it looks good so far). This kernel module allows MPI implementations, for example, to be alerted when registered memory is freed — a critical operation for certain optimizations and proper under-the-covers resource management.
What does this mean to the average MPI application user? It means that future versions of Open MPI (and other MPI implementations) will finally have a solid, bulletproof way to implement the “leave registered” optimization for large message passing. Prior versions of this optimization required nasty, ugly, dirty Linux hacks that sometimes broke real-world applications.
The new way will not break any applications because it gets help from the underlying operating system (rather than trying to go around or hijack certain operating system functions).
Let’s back up a few steps to explain the problem…
In some types of networks (e.g., OpenFabrics), you have to “register” memory with the NIC and the operating system (OS) before you can send or receive with that memory.
Registration is “slow” — you have to trap down to the OS and potentially communicate across the PCI bus to the NIC. Since registration is so expensive, MPI implementations have long-since offered the “leave registered” optimization: once a buffer is passed to an MPI communication call (such as MPI_SEND), the MPI would register it and then leave it registered. Therefore, the next time that buffer is passed to MPI_SEND, it doesn’t have to be registered again — it can be used immediately. This is a fairly important optimization for applications that repeatedly re-use the same communication buffers, especially for large message latency and bandwidth — its effects are easily observable in popular ping-pong benchmarks.
This problem, however, is that the application may free the buffer at any time. For various reasons, freeing registered memory is a Bad Thing: not only is it a memory leak, it can also trick the MPI into thinking that new buffers are actually registered when they’re not. To be clear: MPI doesn’t have direct visibility to know when the free occurred — it doesn’t know when the application frees the buffer, and therefore doesn’t know to de-register it.
What to do?
In the Open MPI project, we’ve gone through several iterations of ugly hacks to intercept various forms of free(), munmap(), etc. Our latest iteration, initially released in version 1.3.2, is perhaps the best version we’ve done so far: we hijack the underlying glibc allocator at run-time using the __malloc_initialize_hook (admittedly, we took some inspiration from the MX and OpenMX projects).
In this way, the underling calls to malloc(), free(), and other related functions call back up into Open MPI to actually do all the work. Hence, Open MPI is notified when memory leaves the process and therefore needs to be de-registered.
It’s elegant, doesn’t require strange linker tricks, and seems to work in all cases.
Except when someone else also hijacks the underlying glibc allocator.
That is, Open MPI can’t guarantee that it will be able to hijack the allocator, and therefore it can’t guarantee that it will be able to use the “leave registered” optimization.
Roland’s “ummunotify” kernel module effectively gives a user level process access to the kernel MMU notification API. Through some simple ioctl’s, a process can indicate that it wants to be notified when specific memory regions leave its memory space.
There’s no hijacking of the underlying allocator involved, no dirty linker tricks, and multiple entities within a single process can elect to receive notifications on the same memory regions (e.g., multiple different middleware packages).
While primarily developed at Cisco, the prototype Open MPI code that uses the new ummunotify code is being tested by various Open MPI community members.
We expect to bring this new feature into the Open MPI mainline Subversion in the not-distant future; hopefully in time for the upcoming 1.5 series (note that the prototype code link will likely go stale once it is merged to the Open MPI mainline).