Cisco Blogs

Registered memory imbalances

June 23, 2012 - 2 Comments

In prior blog posts, I’ve talked about the implications of registered memory for both MPI applications and implementations.

Here’s another fun implication that was discovered within the last few months by Nathan Hjelm and Samuel Gutierrez out at Los Alamos National Labs: registered memory imbalances.

As an interesting side note: as far as we can tell, no other MPI implementation attempts to either balance registered memory between MPI processes, or handle the performance implications that occur with grossly imbalanced registered memory consumption.

Let’s review a few key points before defining what registered memory imbalances are.

Recall that registering memory means two things:

  1. Pinning virtual memory in place to a specific physical memory location
  2. Notifying one or more entities of this virtual-to-physical memory mapping (e.g., notifying an OS-bypass capable NIC, such as an InfiniBand HCA)

In any given system, there is a limit on how much memory can be registered.  There are multiple sources of this limit (e.g., hardware resources on the NIC, tunable parameters for the NIC, operating system limits, amount of physical RAM, etc.), but they all resolve down to one thing: you can only register a fixed, finite amount memory at a time.

This limit can be close to the amount of physical RAM in the machine, or it could be much lower (e.g., I’m told that recent InfiniBand HCAs prefer you to use Linux bigpages, and therefore default to only allowing a relatively low number of pages to be registered).

Regardless, it’s a finite resource.  And it’s a shared resource between all MPI processes that are running on the same machine.

As a consequence, Nathan and Samuel discovered that what can happen is that a small number of MPI processes can consume a large portion of the available registered memory, thereby starving other MPI processes on the same machine.  Indeed, in their experiments, they saw that MPI processes that were close to the HCA (NUMA-wise, that is) had a much greater chance of consuming inordinately more registered memory than other MPI processes.

In hindsight, this is a fairly obvious race condition and consequence of a shared resource.  But there are no good tools to detect such a situation (Nathan only found it by adding instrumentation deep within the bowels of Open MPI), which is one reason we assume no one has discovered this specific issue before.

This registered memory imbalance between MPI processes can actually cause serious performance degradation.  For example, MPI processes can be forced to avoid RDMA-based protocols and fall back to send/receive protocols (which tend to be less efficient on InfiniBand hardware).

In the upcoming Open MPI 1.6.1 release, we will fix a bug related to registered memory imbalances, but not try to address the consequent performance issues that occur.  In the 1.7 series, we have a few ideas about how to ensure that MPI processes don’t (permanently) consume inordinately more registered memory than their peers.

It’s a tricky problem, because you periodically want individual MPI processes to be able to use “too much” registered memory (i.e., more than their “fair share” compared to their peers running on the same machine) to be able to absorb bursty MPI traffic.  But then that “burst” of registered memory must be returned in order to enforce long-term stability of registered memory consumption.


In an effort to keep conversations fresh, Cisco Blogs closes comments after 60 days. Please visit the Cisco Blogs hub page for the latest content.


  1. MSMPI has some tunable parameters to control how large the memory registration cache can grow. It’s not perfect, though, and can still leave large amounts of memory registered because we currently only flush the cache when we register new buffers, rather than also flushing at the end of an RDMA transfer. Our default registration cache size limit is half of the per-core physical memory per process.

    Windows provides an API, CreateMemoryResourceNotification, which returns a handle that is signalled when the requested condition is met. The notification object can be queried (non-blocking), or can be passed to any of the OS wait routines for a blocking notification, and can be used to detect low physical memory conditions.

    Another routine that can be handy is the GlobalMemoryStatusEx function, as it returns total and avaialble physical memory, as well as a ‘memory load’ that indicates how much of physical memory is in use as a percentage.

    • Open MPI has a less-fine-grained approach for registered memory: you can cap the amount of registered memory in a given process (to a fixed number of bytes). It defaults to unlimited, however, which is one reason you can get into these registered memory imbalances.

      We definitely need some work in this area; we’ll be looking at that during the v1.7 series.