Cisco Blogs

SGE debuts topology-aware scheduling

January 23, 2010 - 0 Comments

I just ran across a great blog entry about SGE debuting topology-aware scheduling.  Dan Templeton does a great job of describing the need for processor topology-aware job scheduling within a server.  Many MPI jobs fit exactly within his description of applications that have “serious resource needs” — they typically require lots of CPU and/or network (or other I/O).  Hence, scheduling an MPI job intelligently across not only the network, but also across the network and resources inside the server, is pretty darn important.  It’s all about location, location, location!

Particularly as core counts in individual server are going up. 

Particularly as networks get more complicated inside individual servers. 

Particularly if heterogeneous computing inside a single server becomes popular.

Particularly as resources are now pretty much guaranteed to be non-uniform within an individual server.

These are exactly the reasons that, even though I’m a network middleware developer, I spend time with server-specific projects like hwloc — you really have to take a holistic approach in order to maximize performance.

I think it’s fabulous that the resource managers are starting to take up the topology-aware scheduling banner.  We’ve offered some degree of topology-aware deployment in Open MPI jobs for a long time (see Open MPI’s mpirun(1) man page for details), but we’d be extremely happy to hand off such responsibilities to a real resource manager in a heartbeat, if possible.  We initially did these things in Open MPI because other entities were not doing them, but it’s still a complex issue even if resource managers join in the topology-aware fray.

Linux (and other OSs) “sort of” try to keep individual processes on specific cores, but their (current) efforts are not good enough.  Without direct and consistent process pinning, a process can’t be absolutely guaranteed that the resources that it uses will be “close by” in a NUMA/NUNA environment.  Consider the following scenario:

  1. Process A starts on core X. 
  2. Process A looks around, determines that it’s close to resources J, K, and L, and therefore starts using them. 
  3. But then the OS moves process A to core Y — which happens to be far away from J, K, and L. 

Yoinks!  Sure, process A could tear down all its resource usage and start over, but wow — that’s a major pain in the keister.  And also very error prone.  And also race condition prone.  And potentially harmful from an overall performance perspective (e.g., drain, tear down, and re-create network connections).  Yoinks, indeed.

Far better would be to pin processes down from the very beginning so that locality-based choices can be made once and left in place for the duration of the process (and likely by extension, the entire parallel job).  This makes for simpler code, fewer race conditions, and better overall performance (at least from an HPC perspective!).

To be clear: this issue exposes a religious debate between application/middleware and operating system developers. 

  • The OS developers feel that the OS should magically figure all this stuff out for each process.  It should be able to watch a process for a short while, determine its resource use, and then schedule it to be near the resources that it can use.  Indeed, a process’ placement can change over time a) as its resource usage changes, and b) as the need of other processes change over time.  After all, only the OS has a holistic view of the entire server — it is the only entity that can reasonably determine an optimal scheduling policy for all running processes because it’s the only one with all the knowledge.
  • Application / middleware developers feel that they have better a priori knowledge of what resources they will need.  Tools like hwloc are emerging that allow applications to query the server, see what resources are there, and then set themselves up to utilize those resources well — even if a particular job spans multiple threads or even multiple processes (which is typical for MPI jobs).  As described above, the OS approach of “wait to see what the process does and then schedule it appropriately” has been empirically shown to degrade performance — sometimes significantly.

Granted, my view on this topic is somewhat skewed by typical HPC use-cases: processes are compute-bound and require extreme low latency to various resources (e.g., network I/O).  Hence, HPC apps tend to have very specific ideas and requirements about location.  HPC apps also tend to take over an entire server, making the scheduling and placement decisions in the OS somewhat moot; in such cases, the application has full knowledge of all relevant processes and can therefore make intelligent placement and resource allocation decisions.

Long ago, I had discussions with OS developers about exactly these kinds of topics.  We even tried to find some middle ground: perhaps a resource-hungry application can have some sort of handshake with the OS at or shortly after startup: tell the OS what kinds of resources it needs to be near and then let the OS handle all the nitty-gritty placement details.  In this way, the application can share its a priori knowledge with the OS and then still let the OS handle the overall placement/scheduling optimization across all processes on the machine.  We unfortunately never finished this discussion; it turned out to be surprisingly complex and full of monsters. 

For example, the OS needs to share detailed knowledge of what resources are available (some OS’s do this, some do not — although hwloc is starting to fill that void).  Even better would be OS sharing of utilization information of each resource.  But even creating definitions in this area is difficult at “best” because of the wide variety of types of resources.  In a trivial categorization, some resources are software and some are hardware.  How do you define “utilization” between them?  Tough questions.

What do you think, gentle reader?  I’d love to hear about your thoughts in this area; please share any budding research and/or experimentation work in this area!

In an effort to keep conversations fresh, Cisco Blogs closes comments after 60 days. Please visit the Cisco Blogs hub page for the latest content.