In prior blog posts, I’ve talked about the implications of registered memory for both MPI applications and implementations.
Here’s another fun implication that was discovered within the last few months by Nathan Hjelm and Samuel Gutierrez out at Los Alamos National Labs: registered memory imbalances.
As an interesting side note: as far as we can tell, no other MPI implementation attempts to either balance registered memory between MPI processes, or handle the performance implications that occur with grossly imbalanced registered memory consumption.
Let’s review a few key points before defining what registered memory imbalances are.
Recall that registering memory means two things:
- Pinning virtual memory in place to a specific physical memory location
- Notifying one or more entities of this virtual-to-physical memory mapping (e.g., notifying an OS-bypass capable NIC, such as an InfiniBand HCA)
In any given system, there is a limit on how much memory can be registered. There are multiple sources of this limit (e.g., hardware resources on the NIC, tunable parameters for the NIC, operating system limits, amount of physical RAM, etc.), but they all resolve down to one thing: you can only register a fixed, finite amount memory at a time.
This limit can be close to the amount of physical RAM in the machine, or it could be much lower (e.g., I’m told that recent InfiniBand HCAs prefer you to use Linux bigpages, and therefore default to only allowing a relatively low number of pages to be registered).
Regardless, it’s a finite resource. And it’s a shared resource between all MPI processes that are running on the same machine.
As a consequence, Nathan and Samuel discovered that what can happen is that a small number of MPI processes can consume a large portion of the available registered memory, thereby starving other MPI processes on the same machine. Indeed, in their experiments, they saw that MPI processes that were close to the HCA (NUMA-wise, that is) had a much greater chance of consuming inordinately more registered memory than other MPI processes.
In hindsight, this is a fairly obvious race condition and consequence of a shared resource. But there are no good tools to detect such a situation (Nathan only found it by adding instrumentation deep within the bowels of Open MPI), which is one reason we assume no one has discovered this specific issue before.
This registered memory imbalance between MPI processes can actually cause serious performance degradation. For example, MPI processes can be forced to avoid RDMA-based protocols and fall back to send/receive protocols (which tend to be less efficient on InfiniBand hardware).
In the upcoming Open MPI 1.6.1 release, we will fix a bug related to registered memory imbalances, but not try to address the consequent performance issues that occur. In the 1.7 series, we have a few ideas about how to ensure that MPI processes don’t (permanently) consume inordinately more registered memory than their peers.
It’s a tricky problem, because you periodically want individual MPI processes to be able to use “too much” registered memory (i.e., more than their “fair share” compared to their peers running on the same machine) to be able to absorb bursty MPI traffic. But then that “burst” of registered memory must be returned in order to enforce long-term stability of registered memory consumption.