Registered Memory (RMA / RDMA) and MPI implementations
In a prior blog post, I talked about RMA (and RDMA) networks, and what they mean to MPI implementations. In this post, I’ll talk about one of the consequences of RMA networks: registered memory.
Registered memory is something that most HPC administrators and users have at least heard of, but may not fully understand.
Let me clarify it for you: registered memory is both a curse and a blessing.
It’s more of the former than the latter, if you ask me, but MPI implementations need to use (and track) registered memory to get high performance on today’s high-performance networking API stacks.
Registered memory is used by operating system bypass networks. It works like this:
- The application calls MPI_SEND with buffer XYZ.
- The back-end to MPI_SEND registers buffer XYZ.
- Registering both pins the memory in the operating system (i.e., tells the OS “never swap this page out or move it”) and notifies the NIC of the virtual-to-physical address mapping of this memory.
- MPI_SEND can then tell the networking hardware “send the buffer at virtual memory address XYZ.”
- The NIC hardware retrieves the virtual-to-physical address mapping, finds the physical address, and sends it.
In principle, this is a fine and reasonable mechanism. The NIC hardware is able to directly obtain the buffer from memory without bothering either the OS or the main CPU, leading to low latency and asynchronous message transfer “in the background.”
But here are three problems with it:
- Registration can be a slow process (e.g., traversing the PCI bus — potentially multiple times — to notify the NIC of the virtual-to-physical mapping). To avoid adding registration time to the overall latency of MPI_SEND, MPI implementations tend to cache registered memory after it is used the first time. Meaning: the registered memory is not de-registered when MPI_SEND returns.
- There is only so much registered memory available — it’s limited either by physical memory or NIC resources. The MPI implementation must track how much memory is registered, and sometimes evict prior-registered memory in order to register new memory.
- The user application may free memory that has been registered by MPI. MPI must therefore intercept when memory is returned to the OS and both de-register the memory and update its internal cache of registered memory. This is evil. E. V. I. L. Particularly in cases where the OS and/or the network stack don’t provide adequate mechanisms to intercept freed memory.
(keep in mind that this is an already-too-long blog post; many details have been omitted from the above three points!)
In practice, since extracting high performance from many modern networks requires the use of registered memory, real-world MPI implementations have worked around the above issues. But usually at the expense of having significant amounts of complex code dedicated to these kinds of issues.
My $0.02 is that I wish OS’s and/or network stacks would either obviate the need for registered memory or somehow make its use transparent (and fast).
That would be wonderful. 🙂