Polling vs. blocking message passing progress
Here’s a not-uncommon question that we get on the Open MPI mailing list:
Why do MPI processes consume 100% of the CPU when they’re just waiting for incoming messages?
The answer is rather straightforward: because each MPI process polls aggressively for incoming messages (as opposed to blocking and letting the OS wake it up when a new message arrives). Most MPI implementations do this by default, actually.
The reasons why they do this is a little more complicated, but loosely speaking, one reason is that polling helps get the lowest latency possible for short messages.
As a sidenote, this aggressive polling behavior is exactly why it is a terrible idea to run more MPI processes than you have processors on your machine (usually meaning “processor cores”, but sometimes meaning “hardware threads”).
But let’s talk about why MPI implementations poll aggressively for network traffic, because this flies in the face of conventional userspace / operating system wisdom.
Let’s divide this up in to two categories: TCP/UDP sockets-based APIs and most other kinds of network APIs. And then let’s further divide those into two more categories: latency and bandwidth optimizations.
1. TCP latency: Ethernet NICs, switches, and Linux kernel traps have gotten a lot faster over the past few years. And Ethernet NICs have also modernized how their kernel drivers check for progress (e.g., memory-mapping queue doorbell locations). Hence, non-blocking polling for TCP socket progress is a reasonable mechanism for decreasing short message latency on modern hardware. While polling and blocking effect the same kernel trap, blocking de-schedules and the later re-schedules the process to run. Modern Linux kernels have gotten really good (i.e., fast) at this, but it’s still a non-zero amont of time compared to not being de-scheduled at all. Plus, CPU caches and registers may have to be re-loaded if a process is de-scheduled for a while. All of these overheads from blocking add up and can increase short message latency.
2. TCP bandwidth: Blocking used to be the best way to get the highest bandwidth (usually for long messages) over TCP. But the above-mentioned modern Ethernet NIC cheap-progress-checking mechanisms have driven down the cost of polling on TCP networks. That being said, blocking still likely results in marginally better bandwidth due to lack of kernel/userspace interference (i.e., the kernel may be able to keep send queues filled better if it isn’t contending with userspace for CPU cycles).
3. Other network latency: The key difference between traditional TCP sockets and other high-performance network APIs is that the latter exposes NIC hardware controls to userspace. Hence, userspace can directly control the network device by enqueing new messages to send, checking for the completion of incoming messages, etc. Specifically: a) checking for progress is fast/cheap, and b) there’s no kernel interaction at all. Userspace’s best policy for knowing when new traffic has arrived is therefore to continually poll a hardware progress indicator, allowing it to be notified as soon as possible when a message arrives. This strategy leads to the best (i.e., smallest) short message passing latency possible — 1 or 2 microseconds for half-round-trip MPI short messages on some networks.
4. Other network bandwidth: Active polling on hardware progress doesn’t necessarily affect the maximum achievable bandwidth. Once a message is given to a NIC, the NIC usually wholly takes care of sending it, regardless of whether userspace is checking for progress or not. Active polling can help maximize bandwidth, however, if many short messages are being sent. Active polling can both keep completion queues drained and keep injecting new messages into send queues as space becomes available. Hence, the sending pipeline is kept as full as possible.
There are other reasons why MPI poll for progress, too. Some networks don’t have blocking progress mechanisms (e.g., when using shared memory between two processes for communication — as opposed to a NIC and network). Other networks have shallow send queues, so the MPI needs to poll on send message completion in order to keep injecting new messages into the NIC.
And so on.
Some MPI’s will actively poll for progress for a while, and if nothing happens, shift to a blocking mode. For MPI applications with significant time spent waiting for messages, that may even translate to some power savings if the CPU is able to spin down to a lower frequency.
That being said, if an MPI application is waiting a long time for messages, perhaps its message passing algorithm should be re-designed to be a bit more efficient in terms of communication and computation overlap.Tags: