MPI newbie: What is "operating system bypass"?

The term “operating system bypass” (or “OS bypass”) is typically tossed around in MPI and HPC conversations; it’s generally something that is considered a “must have” in order to get good performance with many MPI applications.

But what is it? And if it’s good for performance, why don’t all applications use OS bypass?

The usual model for accessing networking hardware (e.g., Network Interface Cards, or NICs) is to make userspace API calls — such as TCP socket calls including bind(2), connect(2), accept(2), read(2), write(2), etc. — which then trap down in to the operating system. Eventually, a device driver is invoked that knows how to talk to the specific NIC hardware that is present in the computer.

This is a well-proven model, and is how nearly all applications work outside of HPC.

There are (at least) two big reasons that HPC applications would prefer not to use this model:

Trapping down into the kernel, traversing the entire OS networking software stack, and ultimately ending up in a specific device driver is… “slow.” I say “slow” in quotes because it’s not actually slow — it works great for 99.99999% of the world’s applications. But HPC applications that need ultra-low latency for short network message see the time added by these actions and think, “We can avoid all of that.”
While the spectrum of requirements from the entire HPC ecosystem is quite large, many HPC applications share some common characteristics. For example: a single running HPC job does not need to interoperate with a wide variety of hardware, it does not need to communicate over the WAN, it typically only communicates with a small number of peers, …etc. In short: many assumptions can be made about what a typical HPC application will not do, and therefore much of the handling in the OS general-purpose networking stack is unnecessary.

Put differently: the specialized nature of HPC applications obviate the need for general-purpose networking behavior, thereby allowing the use of smaller, highly specialized, and extremely efficient network stacks (that are constrained to a specific set of assumptions).

These software stacks live in userspace middleware libraries (such as MPI), and can therefore expose extremely high levels of network I/O performance to HPC applications. Since these libraries communicate directly with NIC hardware, they effectively bring mini specialized “device drivers” up into userspace.

As a userspace “device driver,” such libraries directly inject network traffic into NIC hardware resources. Likewise, high performance NICs typically can steer inbound MPI traffic directly to the target MPI process. Meaning: there is no need to dispatch inbound traffic to the final target MPI process in software (which would be slower).

Bypassing the OS network stack in this way can result in extremely low latency for short messages, which can be a key factor in overall HPC application performance (remember: many HPC applications need to exchange short messages frequently).

It should be noted that the gains in performance described here definitely have a cost: the loss of flexibility.

For example, modern MPI libraries tend to make assumptions about being able to fully utilize CPU cores to spin on network hardware resources to check for progress. This is great for HPC applications where there will only be one process per CPU core, but would be horrible outside of those assumptions (e.g., in a heavily oversubscribed virtualized environment).

Additionally, the level of wire protocol interoperability is usually quite low: an individual process in a running MPI job, for example, typically assumes that all its peers are speaking the same wire protocol. It may even assume that all of its peers are using NICs from the same vendor — possibly even the exact same firmware level.

Such assumptions lead to simplifications in performance-critical code paths, which helps further reduce the latency of short messages.

Because of these kinds of factors, OS-bypass techniques — and the code path simplifications and other optimizations that typically accompany OS-bypass — are only suitable in controlled environments where many assumptions and restrictions can be made. While this is fine for HPC applications, it is simply not practical for general purpose networking applications.