Cisco Logo


High Performance Computing Networking

There was a great comment chain on my prior post (“Unexpected Linux Memory Migration“) which brought out a number of good points.  Let me clarify a few things from my post:

To be clear, Open MPI has a few cases where it has very specific memory affinity needs that almost certainly fall outside the realm of just about all OS’s default memory placement schemes.  My point is that other applications may also have similar requirements, particularly as core counts are going up, and therefore communication between threads / processes on different cores will become more common.

Open MPI sets up a series of queue data structures in shared memory between processes on the same server.  The queues are uni-directional; they are used to send from process A to process B.  These queues are used to exchange control and data messages.

Using shared memory alone is enough to introduce variability into the specific memory placement of the pages involved; which process “owns” the pages, and therefore specifies the policy where they should be physically located?

Here are two cases where specific memory placement is important in this scenario.  In each case, the shared memory buffer may be located on the sender’s NUMA node, on the receiver’s NUMA node, or some other NUMA node:

  1. The sender copies the message data from its source buffer to the shared memory.  When ready, the receiver then copies the message data from the shared memory to its target buffer.  If the shared memory is on either the sender’s or receiver’s NUMA node, then the data movement will traverse the inter-NUMA-node memory interconnect at most once.
  2. After copying the message to shared memory, the sender tweaks a flag in the shared memory to let the receiver know that a new message has been delivered.  The receiver spins on this flag.  It is fastest if the flag is located in shared memory on the receiver’s NUMA node — the lookup is local and it does not cause any traffic to flow across the intern-NUMA-node memory interconnect.

In both cases, unexpected memory locality may even impose performance penalties on unrelated processes by consuming both local memory on their NUMA nodes and bandwidth across the intern-NUMA-node memory interconnect.

In the latency- and bandwidth-benchmark centric HPC community, such performance losses can have a disastrous effect on an MPI implementation’s reputation.  Hence, MPI needs to ensure that shared memory pages are located exactly where they need to be — implicit OS memory placement is not sufficient.

Granted, I’m painting an intentionally dire picture here.  In today’s reality, particularly outside of the HPC community, the OS does a pretty good job of keeping memory local, and these kinds of effects either don’t happen or don’t matter in today’s fast architectures.

But consider May’s Law — today’s minor performance sins may become magnified in tomorrow’s computing architectures.

It’s all about location, location, LOCATION!

Comments Are Closed

  1. Return to Countries/Regions
  2. Return to Home
  1. All High Performance Computing Networking
  2. Return to Home