Jeff is lazy this week, so he asked that I provide some notes on the process binding options available in the Open MPI (OMPI) v1.5 release series.
First, though, a caveat. The binding options in the v1.5 series are pretty much the same as in the prior v1.4 series. However, future releases (beginning with the v1.7 series) will have significantly different options providing a broader array of controls. I won’t address those here, but will do so in a later post.
Binding in the v1.5 series is limited to the socket and core levels (there exist CLI options for board level binding – but they were just placeholders intended for the future). The
--bind-to-socket option obviously binds each process to a socket – which begs the question: which socket is it going to be bound to? You have two choices:
- The default is to bind each MPI process to a socket on a “per-core” basis. For example, suppose we have a host with 2 sockets, each having 4 cores. If we launch 5 processes on this host and
--bind-to-socket, then 4 processes will be bound to (all the cores in) socket 0, and one process will be bound to (all the cores in) socket 1. Obviously, this doesn’t balance the load very well! And so we also have…
--bysocketoption. Adding this option causes the processes to be split across the sockets on a round-robin basis. Think of it as load balancing the processes across the sockets. In the prior example, adding
--bysocketwould causes the MPI_COMM_WORLD (MCW) ranks 0,2,4 to be bound to socket 0, and MCW ranks 1,3 to be bound to socket 1.
Okay, so now what if I have four MPI processes on the host, but I want MCW ranks 0,1 to share the first socket and MCW ranks 2,3 to share the second socket? This might be desirable for NUMA reasons when the communication pattern favors communicating between ranks of the same odd/even flavor. Well, the
--npersocket option was created for that reason.
--npersocket 2 in this case would result in the desired pattern. The option causes OMPI to place the specified number of MPI processes on each socket in a sequential manner, filling each socket with a number of processes equal to the number of underlying cores before moving on to the next one. Of course, you can limit the total number of processes in the job by combining it with the
--npersocket with the
--bysocket option causes OMPI to still assign the specified number of MPI processes on each socket, but to do so on a round-robin basis across the sockets instead of sequentially filling each socket. Thus, a command line of
--npersocket 2 --bysocket would result in MCW ranks 0,2 on the first socket, and MCW ranks 1,3 on the second. Omitting the
--bysocket option would result in MCW ranks 0,1 on the first socket, and MCW ranks 2,3 on the second.
Okay, now let’s look at the core binding options. The
--bind-to-core option does what you would expect – it binds one MPI process to each core in a sequential fashion. So the first MPI process gets bound to the first core of the first socket, the second MPI process gets bound to the second core of the first socket, etc. Once the first socket is filled, the procedure continues on to sequentially fill the next socket.
As before, this results in a “front loaded” layout – i.e., the first socket gets loaded first, while subsequent sockets may not be fully utilized. This is frequently undesirable, and so you can combine
--bysocket to direct OMPI to spread the ranks across the sockets. In this case, the first MPI process would still fall on the first core of the first socket, but the second MPI process will go on the first core of the second socket, etc.
There are two additional options that affect core-level binding. We recognize that users sometimes have a need to bind a process to more than one core – e.g., to support a multi-threaded process. The
--cpus-per-proc N (a.k.a. its more-well-known-but-less-accurate-name:
--cpus-per-rank N) option was created for that purpose. To illustrate its use, let’s return to our prior example and add a request for 2 cores/rank:
$ mpirun -np 4 --npersocket 2 --bind-to-core --cpus-per-proc 2 ...
Given this input, OMPI will place two MPI processes on each socket of every host. Each process will be bound to 2 cores within its socket. So MCW rank 0 will be bound to cores 0,1 on socket 0, MCW rank 1 will be bound to cores 2,3 on socket 0, MCW rank 2 will be bound to cores 0,1 on socket 1, and MCW rank 3 will be bound to cores 2,3 on socket 1.
Finally, there are the
--rankfile options. These are topics in their own right – suffice to say, they can be used to provide detailed mapping and binding patterns on a process-by-process basis. I’ll leave that for some future note as few people use it, but will talk briefly about a particular use of
--cpu-set that can be helpful.
--cpu-set option can, among other things, be used to limit OMPI to a set of cores or sockets on each host in your allocation. To see how this could be useful, consider the case where you have obtained an allocation of hosts and wish to run several applications on them at the same time. You want to bind your processes to get performance, but obviously don’t want the processes from the different applications to be bound on top of each other. Unfortunately, each invocation of mpirun is independent and has no idea what any other invocation has done vis-a-vis assigning processes to cores or sockets. So by default, two invocations of mpirun on the same hosts will overlay their associated processes on top of each other.
What you can do to prevent this is to run each invocation of mpirun with a different cpu set. For example:
$ mpirun --cpu-set S0 ... $ mpirun --cpu-set S1 ...
would limit the first invocation to using only socket 0 on each host, and the second invocation to only socket 1. You can similarly constrain mpirun to using specific cores and core ranges – e.g.,
--cpu-set 0-2 would limit mpirun to using cpus 0-2 on each host. Obviously, you would want to take such limits into consideration when assigning the number of MPI processes to each host! The
--cpu-set option can be used in combination with all of the prior binding options to create the desired binding patterns within the specified constraint.
Don’t despair if you find this confusing – sometimes, the best option is to just try a few example configurations until you get the one you like (i.e., which one gives the best performance. All kinds of factors can come into play, such as the cache sizes and NUMA layout of each host, etc. Some experimentation may be necessary. The
lstopo tool from the Hardware Locality (hwloc) toolset may be of enormous help in understanding your host’s hardware. Finally, the
--report-bindings option will show you where OMPI actually bound your processes.