There was a great comment chain on my prior post (“Unexpected Linux Memory Migration“) which brought out a number of good points. Let me clarify a few things from my post:
- My comments were definitely about HPC types of applications, which are admittedly a small subset of applications that run on Linux. It is probably a fair statement to say that the OS’s treatment of memory affinity will be just fine for most (non-HPC) applications.
- Note, however, that Microsoft Windows and Solaris do retain memory affinity information when pages are swapped out. When the pages are swapped back in, if they were bound to a specific locality before swapping, they are restored to that same locality. This is why I was a bit surprised by Linux’s behavior.
- More specifically, Microsoft Windows and Solaris seem to treat memory locality as a binding decision — Linux treats it as a hint.
- Many (most?) HPC applications are designed not to cause paging. However, at least some do. A side point of this blog is that HPC is becoming commoditized — not everyone is out at the bleeding edge (meaning: some people willingly violate the “do not page” HPC mantra and are willing to give up a little performance in exchange for the other benefits that swapping provides).
To be clear, Open MPI has a few cases where it has very specific memory affinity needs that almost certainly fall outside the realm of just about all OS’s default memory placement schemes. My point is that other applications may also have similar requirements, particularly as core counts are going up, and therefore communication between threads / processes on different cores will become more common.
Open MPI sets up a series of queue data structures in shared memory between processes on the same server. The queues are uni-directional; they are used to send from process A to process B. These queues are used to exchange control and data messages.
Using shared memory alone is enough to introduce variability into the specific memory placement of the pages involved; which process “owns” the pages, and therefore specifies the policy where they should be physically located?
Here are two cases where specific memory placement is important in this scenario. In each case, the shared memory buffer may be located on the sender’s NUMA node, on the receiver’s NUMA node, or some other NUMA node:
- The sender copies the message data from its source buffer to the shared memory. When ready, the receiver then copies the message data from the shared memory to its target buffer. If the shared memory is on either the sender’s or receiver’s NUMA node, then the data movement will traverse the inter-NUMA-node memory interconnect at most once.
- After copying the message to shared memory, the sender tweaks a flag in the shared memory to let the receiver know that a new message has been delivered. The receiver spins on this flag. It is fastest if the flag is located in shared memory on the receiver’s NUMA node — the lookup is local and it does not cause any traffic to flow across the intern-NUMA-node memory interconnect.
In both cases, unexpected memory locality may even impose performance penalties on unrelated processes by consuming both local memory on their NUMA nodes and bandwidth across the intern-NUMA-node memory interconnect.
In the latency- and bandwidth-benchmark centric HPC community, such performance losses can have a disastrous effect on an MPI implementation’s reputation. Hence, MPI needs to ensure that shared memory pages are located exactly where they need to be — implicit OS memory placement is not sufficient.
Granted, I’m painting an intentionally dire picture here. In today’s reality, particularly outside of the HPC community, the OS does a pretty good job of keeping memory local, and these kinds of effects either don’t happen or don’t matter in today’s fast architectures.
But consider May’s Law — today’s minor performance sins may become magnified in tomorrow’s computing architectures.
It’s all about location, location, LOCATION!