Cisco Blogs


Cisco Blog > High Performance Computing Networking

Polling vs. blocking message passing progress

April 20, 2012
at 6:17 am PST

Here’s a not-uncommon question that we get on the Open MPI mailing list:

Why do MPI processes consume 100% of the CPU when they’re just waiting for incoming messages?

The answer is rather straightforward: because each MPI process polls aggressively for incoming messages (as opposed to blocking and letting the OS wake it up when a new message arrives).  Most MPI implementations do this by default, actually.

The reasons why they do this is a little more complicated, but loosely speaking, one reason is that polling helps get the lowest latency possible for short messages.

As a sidenote, this aggressive polling behavior is exactly why it is a terrible idea to run more MPI processes than you have processors on your machine (usually meaning “processor cores”, but sometimes meaning “hardware threads”).

But let’s talk about why MPI implementations poll aggressively for network traffic, because this flies in the face of conventional userspace / operating system wisdom.

Let’s divide this up in to two categories: TCP/UDP sockets-based APIs and most other kinds of network APIs.  And then let’s further divide those into two more categories: latency and bandwidth optimizations.

1. TCP latency: Ethernet NICs, switches, and Linux kernel traps have gotten a lot faster over the past few years.  And Ethernet NICs have also modernized how their kernel drivers check for progress (e.g., memory-mapping queue doorbell locations).  Hence, non-blocking polling for TCP socket progress is a reasonable mechanism for decreasing short message latency on modern hardware.  While polling and blocking effect the same kernel trap, blocking de-schedules and the later re-schedules the process to run.  Modern Linux kernels have gotten really good (i.e., fast) at this, but it’s still a non-zero amont of time compared to not being de-scheduled at all.  Plus, CPU caches and registers may have to be re-loaded if a process is de-scheduled for a while.  All of these overheads from blocking add up and can increase short message latency.

2. TCP bandwidth: Blocking used to be the best way to get the highest bandwidth (usually for long messages) over TCP.  But the above-mentioned modern Ethernet NIC cheap-progress-checking mechanisms have driven down the cost of polling on TCP networks.  That being said, blocking still likely results in marginally better bandwidth due to lack of kernel/userspace interference (i.e., the kernel may be able to keep send queues filled better if it isn’t contending with userspace for CPU cycles).

3. Other network latency: The key difference between traditional TCP sockets and other high-performance network APIs is that the latter exposes NIC hardware controls to userspace.  Hence, userspace can directly control the network device by enqueing new messages to send, checking for the completion of incoming messages, etc.  Specifically: a) checking for progress is fast/cheap, and b) there’s no kernel interaction at all.  Userspace’s best policy for knowing when new traffic has arrived is therefore to continually poll a hardware progress indicator, allowing it to be notified as soon as possible when a message arrives.  This strategy leads to the best (i.e., smallest) short message passing latency possible — 1 or 2 microseconds for half-round-trip MPI short messages on some networks.

4. Other network bandwidth: Active polling on hardware progress doesn’t necessarily affect the maximum achievable bandwidth.  Once a message is given to a NIC, the NIC usually wholly takes care of sending it, regardless of whether userspace is checking for progress or not.  Active polling can help maximize bandwidth, however, if many short messages are being sent.  Active polling can both keep completion queues drained and keep injecting new messages into send queues as space becomes available.  Hence, the sending pipeline is kept as full as possible.

There are other reasons why MPI poll for progress, too.  Some networks don’t have blocking progress mechanisms (e.g., when using shared memory between two processes for communication — as opposed to a NIC and network).  Other networks have shallow send queues, so the MPI needs to poll on send message completion in order to keep injecting new messages into the NIC.

And so on.

Some MPI’s will actively poll for progress for a while, and if nothing happens, shift to a blocking mode.  For MPI applications with significant time spent waiting for messages, that may even translate to some power savings if the CPU is able to spin down to a lower frequency.

That being said, if an MPI application is waiting a long time for messages, perhaps its message passing algorithm should be re-designed to be a bit more efficient in terms of communication and computation overlap.  :-)

Tags: ,

In an effort to keep conversations fresh, Cisco Blogs closes comments after 60 days. Please visit the Cisco Blogs hub page for the latest content.

4 Comments.


  1. Perhaps you should first answer this question:

    Why do MPI users put blocking MPI_Recv calls into their code if they expect calling processes to wait for incoming messages long enough to noticing the time spent polling?

    :-)

       0 likes

  2. April 23, 2012 at 7:19 am

    I think the more common case we get asked about is for oversubscription (although that isn’t very common, either). I.e., users want to run in a test/development mode on their desktop/laptop with oodles of MPI processes — waaaay more than they have x86-64 cores, for example.

       0 likes

  3. What about the mpi_yield_when_idle option in OpenMPI? Does it go back to the interupt method? Should this be used when over subscribing by default?
    http://www.msg.chem.iastate.edu/GAMESS/

    Specificly calls for oversubscription in their code as some processes don’t do much I guess. Wish I knew more about it to help my users.

       0 likes

  4. April 23, 2012 at 1:38 pm

    mpi_yield_when_idle was only ever semi-useful. All it did was call sched_yield() in the middle of its progression loop (it did not switch into a blocking mode). Calling sched_yield() would allow MPI processes to voluntarily yield the scheduling slot on the CPU; presumably, the OS would then pick a different process to run. If you oversubcribe (i.e., run more MPI processes than you have processors), that scheme might allow some modicum of progress across all your MPI processes, but there was no guarantee of fairness. And your processes will likely migrate all over the machine, destroying any cache and memory locality that they might of had. And if you ended up swapping, fuggedaboudit.

    Of course, if you’re just developing and debugging on a single node (e.g., a laptop) with toy problem sizes, these might be acceptable trade-offs.

    That being said, sched_yield() got pretty well neutered in recent Linux kernels. That is, there was a big kerfuffle on the LKML about what exactly sched_yield() was actually supposed to do, particularly on each of the different Linux kernel schedulers. The conclusion that they came to was that sched_yield() has essentially become a no-op.

       0 likes