Network hardware offload
Sorry for the lack of activity here this month, folks. As usual, December is the month to recover from SC and catch up on everything else you were supposed to be doing. So I’ll try to make up for it with a small-but-tasty Christmas morsel. Then I’ll disappear for a long winter’s nap; you likely won’t see me until January (shh! don’t tell my wife that I’m working today!).
The topic of my musing today is one that has come up multiple times in conversation over the past two weeks. Although I’m certainly not the only guy to talk about this on the interwebs, today’s topic is server-side hardware offload of network communications.
“Ah!” says the typical cluster-HPC user. “You mean RDMA, like InfiniBand!” (some people might even remember to cite OpenFabrics, which includes iWARP).
No, that’s not what I mean, and that’s the one of the points of this entry.
The hardware offload that I’m referring to is a host-side network adapter that offloads most of the networking “work” so that the server’s main CPU(s) don’t have to. In this way, you can have dedicated (read: very fast/optimized) hardware do the heavy lifting while the rest of the server’s resources are free to do other stuff. Among other things, this means that the main CPU(s) don’t have to process all that network traffic, protocol, and other random associated overhead. Depending on the network protocol used, offloading to dedicated hardware may or may not save a lot of processing cycles. Sending and receiving TCP data, for example, may take a lot of cycles in a software-based protocol stack. Sending and receiving raw ethernet frames may not (YMMV, of course — depending on your networking hardware, server hardware, operating system, yadda yadda yadda).
That being said, it’s not just processor cycles that are saved. Caches — both instruction and data — are likely not to be thrashed. Interrupts may be fired less frequently. There may be (slightly) less data transferred across internal buses. …and so on. All of these things add up: server-side network hardware offload is a Good Thing; it can make a server generally more efficient because of the combination of several effects.
Hardware offload is frequently associated with operating system (OS) bypass techniques. The rationale here is that trapping down into the operating system is sometimes “slow” — you can save a little time by skipping the OS layer and communicating directly with networking hardware from user space. This is somewhat of a contested topic; some people [fervently] believe that OS bypass is necessary for high performance. Others believe that modern OS’s provide fast enough access from user space to the networking drivers such that the complexities of OS-bypass methods just aren’t worth it. This is actually quite an involved topic; I won’t attempt to unravel it today.
Where were we? …oh yes, network offload.
Over the years, many MPI implementations have benefited from one form of network offload or another. MPI implementations that take advantage of hardware offload typically not only increase efficiency as described above, but also provide true communication / computation overlap (C/C overlap issues have been discussed in academic literature for many years). True overlap allows a well-coded MPI application to start a particular communication and then go off and do other application-meaningful stuff while the MPI (the network offload hardware, for the purposes of this blog entry) to progress most — if not all — of the message passing progress independent of the main server processor(s).
Network offload is typically most beneficial with long messages — sending a short message in its entirety can frequently cost exactly the same as starting a communication action and then polling it for completion later. The effective overlap for short messages can be negligible (or even negative). Hence, the biggest “win” of hardware offload is for well-coded applications that send and receive large messages. That being said, hardware offload for small messages may also benefit, such as when associated with deep server-side hardware buffering and the ability to continue progressing flow control and protocol issues independent of the main processor.
All this being said, note that Remote Direct Memory Access (RDMA) is a popular / well-known flavor of hardware offload these days — but it is one of many. Vendors have churned out various forms of network hardware offload over the past 20+ years. Indeed, there have been many academic discussions over the past few years discussing returning to the idea of using a “normal” CPU/processor in “dedicated network” mode (fueled by the fires of manycore, of course): if you have scads and scads of cores, who’s going to miss one [or more?] of them? Dedicate a few of them to act as the network proxies for the rest of the cores. Such schemes have both benefits and drawbacks, of course (and it’s been tried before, but not necessarily in exactly the same context as manycore). The jury’s still out on how both the engineering and market forces will affect these ideas.
MPI will use whatever is available when trying to attain high performance — including hardware offload (such as RDMA). But to be totally clear: RDMA is not what enables high performance in MPI — hardware offload is (one way) to attain high performance in an MPI implementation. RDMA just happens to be among the most recent flavors of network hardware offload.