After years of discussion, the upcoming release of Open MPI 1.7.4 will change how processes are laid out (“mapped”) and bound by default. Here’s the specifics:
- If the number of processes is <= 2, processes will be mapped by core
- If the number of processes is > 2, processes will be mapped by socket
- Processes will be bound to core
- MPI_COMM_WORLD ranks will be assigned by slot
These are all the default values — they, of course, can be changed by the user via mpirun CLI options, environment variables, etc.
Why did we (finally) make this change?
Two main reasons:
- Invoking processor affinity by default is definitely helpful to many kinds of applications — especially benchmarks (yes, I’m being quite blunt here).
- Other MPI implementations bind by default, and then use that to bash Open MPI’s “out of the box” performance.
Enabling processor affinity is beneficial because OS’s — including modern Linux — still don’t do a great job of not moving processes around, even in steady state. That is: there is typically a small-but-noticeable difference between binding MPI processes and not binding them. Indeed, in some cases, there’s a large noticeable difference.
The catch, however, is that only the MPI application knows what mapping, MPI_COMM_WORLD, and binding patterns are best for it.
There definitely seems to be a spectrum of possibilities and applications:
- Doing nothing is definitely not harmful to any application, but it can leave performance on the table, so to speak
- 2-process MPI jobs tend to be benchmarks, and the (bind-to-core, map-by-core) pattern tends to be good for them
- >2-process MPI jobs tend to be real applications, and can benefit from the greater memory bandwidth availability of (bind-to-core + map-by-socket)
Are these patterns right for all apps? Definitely not.
For example, these patterns are definitely not good for apps that only use half the cores in a server because of memory bandwidth constraints. However, we’re told that users who need this pattern already set their own mapping, binding, and ordering patterns via mpirun CLI options. So these new defaults won’t affect them at all.
That being said, the two default patterns we’re using tend to leave less performance on the table than doing nothing. So we’ll see how they work out in the real world.
Let us know what you think.
- Do these patterns work for you? Why or why not?
- Do you even care? I.e., do you already set your own mapping, binding, and/or ordering?
What if there’s a single process per node. Do you put it near the NIC by default instead of on the first core ?
No, not as of yet. There’s the question of “which NIC?” which can really only be answered by a human…
“For example, these patterns are definitely not good for apps that only use half the cores in a server because of memory bandwidth constraints. However, we’re told that users who need this pattern already set their own mapping, binding, and ordering patterns via mpirun CLI options. So these new defaults won’t affect them at all.”
This is far from my experience in the wild and I expect that a number of application users are going to be hosed by this change.
Jeff —
Can you explain more? Are you saying that the applications who run on (for example) only half the cores do *not* use mapping or binding options, and just expect the MPI implementation and/or OS to figure it out?