Cisco Blogs

Sockets, cores, and hyperthreads… oh my!

October 15, 2010 - 4 Comments

Core counts are going up.  Cisco’s C460 rack-mount server series, for example, can have up to 32 Nehalem EX cores.  As a direct result, we may well be returning to the era of running more than one MPI process per server.  This has long been true in “big iron” parallel resources, but commodity Linux HPC clusters have tended towards the one-MPI-job-per-server model in recent history.

Because of this trend, I have an open-ended question for MPI users and cluster administrators: how do you want to bind MPI processes to processors?  For example: what kinds of binding patterns do you want?  How many hyperthreads / cores / sockets do you want each process to bind to?  How do you want to specify what process binds where?  What level of granularity of control do you want / need?  (…and so on)

We are finding that every user we ask seems to have slightly different answers.  What do you think?  Let me know in the comments, below.

I ask because we are in the midst of revamping Open MPI’s processor and memory affinity support (i.e., how Open MPI binds processes and memory to underlying resources).  It is turning out to be a surprisingly complicated issue.

Other MPI implementations seem to mainly support binding processes to single cores.  We’re assuming that this may not always be desirable, particularly as core counts increase / multiple MPI jobs are run on a single server.  For example, it may be desirable to bind to something as wide as an entire socket (or maybe even wider!).

Here’s a few leading questions to start the conversation:

  • Do you ever want to bind an MPI process to more than one core / hyperthread?  If so, how “wide” do you need the binding to be?  An entire core?  Socket?  Whatever cores share a common L2?  …?
  • Will you always bind your MPI processes in a regular pattern?  (e.g., every process gets 2 cores)  Or does your application have irregular needs? (e.g., process A gets 2 cores, but process B gets an entire socket)
  • After binding to processors, how do you want MPI processes ordered in MPI_COMM_WORLD?  (e.g., round-robin by core, by socket, by …?)
  • How would you specify the process-to-processor mapping on a command line?

Keep in mind that complicating all of this is the fact that resource managers (RMs) may not let MPI have the entire set of processors on a given server (e.g., if multiple MPI jobs have been assigned to a single server).  And you may not know this until the MPI job is actually launched.  Ouch!

That being said, in an RM-based environment, I think it should be safe to assume that if you want to be able to bind MPI processes to entire sockets, it is the user’s responsibility to ask the RM for an entire socket.  Hence, the RM won’t launch your job until it has an entire sockets for each process.

Good or bad assumption — what do you think?

In an effort to keep conversations fresh, Cisco Blogs closes comments after 60 days. Please visit the Cisco Blogs hub page for the latest content.


  1. The vast majority of my jobs are limited by memory performance so binding to NUMA nodes (=socket, usually) is absolutely required. I rarely see a difference between binding to a socket and binding to individual cores, there may indeed be a benefit of less daemon noise when binding to socket, but that conclusion was within measurement error. Binding more locally than per-socket, such as to shared L2 caches would be useful. Sometimes we run fewer jobs per socket than cores per socket because peak bandwidth is often attained with fewer jobs (vendors are busy chasing Linpack instead of meaningful performance).

    • Excellent — you just provided justification for an example binding pattern that I am arguing with my fellow OMPI’ers that we need to support. Pretend you have a 4 socket machine, each of which has 4 cores, and we can represent it like this:

      _ _ _ _ / _ _ _ _ / _ _ _ _ / _ _ _ _

      Further, let’s assume I have a memory-intensive, lightly-threaded MPI application (each MPI proc has 2 threads). Given peak memory bandwidths, I might want to bind it like this (each number represents an MCW rank):

      1 1 _ _ / 2 2 _ _ / 3 3 _ _ / 4 4 _ _

      I want to keep the procs bound to 2 cores on the socket because those 2 cores share an L2 cache — I wouldn’t want the threads bouncing over to the other 2 cores because that might end up thrashing my L3 cache that is shared between all 4 cores on the socket.


  2. There are a few issues here:

    1. Does the application support threading? If not, then one has to do one MPI rank per core to utilize the machine efficiently. This also means the user sucks and needs to get a book on OpenMP ASAP.

    2.If the application supports threads, the how many MPI ranks per node are used should be determined by the ratio of application communication to network resources. If one rank saturates NIC BW, then one rank per node is appropriate. Given that most applications are not so communication intensive, I think a good starting point is to say one MPI rank per socket if multithreading is done properly.

    3. Binding of processes to cores should be determined by the memory hierarchy. AMD tends to have pretty nasty NUMA effects and one has to be careful where the communication is generated from. If one has threads, it is critical to bind the processes which generate MPI communication to the cores closest to the NIC. If the memory hierarchy is flat, then binding is less important, but I don’t see any point to letting the MPI processes float around on the cores.

    4. Ordering of MPI_COMM_WORLD should be user-defined at boot like Blue Gene/P. For most systems, depth-first and breadth-first are sufficient, but one can get any regular configuration by setting stride parameters.

    I may be answering slightly different questions than you asked because you didn’t bring up the thread issue, but at least for the modern users (not necessarily in HPC, where threads are still treated with skepticism by many) the critical issue is how to optimize MPI in the presence of threads.

    • You’re right, I didn’t explicitly mention threads, but it is definitely on our minds (regardless of whether it’s OpenMP or something else) — otherwise, binding to wider than a core wouldn’t be as useful.

      Without threaded MPI processes, a) binding to a core is likely sufficient, and b) mapping of MPI_COMM_WORLD ranks is (assumedly) really a function of communication locality: put procs close to each other who communicate frequently.

      With threaded MPI processes, it’s more complicated because we have to find an area to place a given process that has both the right width (e.g., enough cores/PUs/whatevers) and right locality. For example: in a NUNA world, which proximity is more important — shared l2/l3/numa node to local procs, or the NIC (especially when dealing with multiple MPI procs on the same server)? “It depends” is only answer I can think of that makes sense — it depends on what the application(s) want(s)/need(s).

      …where “we” in the above paragraph might well be the MPI implementation, the resource manager, and/or both.

      Your point about BG/P is a good one; need to go think about that…