Core counts are going up. Cisco’s C460 rack-mount server series, for example, can have up to 32 Nehalem EX cores. As a direct result, we may well be returning to the era of running more than one MPI process per server. This has long been true in “big iron” parallel resources, but commodity Linux HPC clusters have tended towards the one-MPI-job-per-server model in recent history.
Because of this trend, I have an open-ended question for MPI users and cluster administrators: how do you want to bind MPI processes to processors? For example: what kinds of binding patterns do you want? How many hyperthreads / cores / sockets do you want each process to bind to? How do you want to specify what process binds where? What level of granularity of control do you want / need? (…and so on)
We are finding that every user we ask seems to have slightly different answers. What do you think? Let me know in the comments, below.
I ask because we are in the midst of revamping Open MPI’s processor and memory affinity support (i.e., how Open MPI binds processes and memory to underlying resources). It is turning out to be a surprisingly complicated issue.
Other MPI implementations seem to mainly support binding processes to single cores. We’re assuming that this may not always be desirable, particularly as core counts increase / multiple MPI jobs are run on a single server. For example, it may be desirable to bind to something as wide as an entire socket (or maybe even wider!).
Here’s a few leading questions to start the conversation:
- Do you ever want to bind an MPI process to more than one core / hyperthread? If so, how “wide” do you need the binding to be? An entire core? Socket? Whatever cores share a common L2? …?
- Will you always bind your MPI processes in a regular pattern? (e.g., every process gets 2 cores) Or does your application have irregular needs? (e.g., process A gets 2 cores, but process B gets an entire socket)
- After binding to processors, how do you want MPI processes ordered in MPI_COMM_WORLD? (e.g., round-robin by core, by socket, by …?)
- How would you specify the process-to-processor mapping on a command line?
Keep in mind that complicating all of this is the fact that resource managers (RMs) may not let MPI have the entire set of processors on a given server (e.g., if multiple MPI jobs have been assigned to a single server). And you may not know this until the MPI job is actually launched. Ouch!
That being said, in an RM-based environment, I think it should be safe to assume that if you want to be able to bind MPI processes to entire sockets, it is the user’s responsibility to ask the RM for an entire socket. Hence, the RM won’t launch your job until it has an entire sockets for each process.
Good or bad assumption — what do you think?