Sockets vs. MPI
I briefly mentioned in a prior blog entry that I’m on a panel at the Hot Interconnects conference this Wednesday evening entitled, “Stuck with Sockets: Why is the network programming interface still from the 1980s?“.
The topic is an interesting one: sockets are, by far, the dominant user-level networking abstraction. Countless millions (billions?) of lines of code exist that use various forms and features of the BSD sockets API (there are other sockets APIs, but let’s limit the discussion here to just the BSD API for the sake of brevity). But networking hardware (and software!) has advanced significantly since the sockets API was introduced. Several attempts have been made to advance the state of the art of the sockets API — to include replacing it with something else — but none have succeeded. Why? Is the inertia of existing sockets code too resistant to any change?
MPI made significant inroads in the HPC community and has effectively displaced sockets (and just about anything else) as the predominant user-level message passing API.
- How did this happen, especially given that sockets have not been displaced anywhere else?
- Why hasn’t MPI displaced sockets in any other environment?
At its core, MPI provides a convenient, high-level abstraction for sending and receiving messages (not streams!). Just call the MPI_SEND with an integer identifier of the receiver and the message you want to send, and magic occurs: the message arrives at the destination. The MPI application programmer need not manage connections, streams, pipelining, multiplexing, … or one of a hundred other issues that are needed for high network throughput performance. MPI application programmers can therefore focus on the computational problem that they’re trying to solve — not the underlying network.
When pairing the simplicity of the MPI API with its portability across multiple platforms and hard-earned reputation for high performance, MPI became an obvious choice for many HPC applications. For example, why would you want to deal with the complexity of streams when you can use messages instead? This is how MPI “won” in the HPC space.
Another driving force for MPI in HPC (instead of sockets) is that customized, proprietary networks were created to give extremely high throughput and low latency. Such networks were specifically designed with different abstractions than those offered by sockets — emulating sockets on these networks was therefore slow and sometimes even inefficient. Under the covers, MPI implementations could use the “native” network abstractions to achieve much higher performance than sockets.
This is still true of many of today’s HPC-oriented networks.
However, one major factor holding back the use of MPI in other types of environments is that MPI implementations tend to lack robustness in the event of failure. Indeed, the MPI standard specifically shies away from defining error behavior (although this may change in MPI-3). If you can’t provide the same level of robustness that the rest of the world (i.e., non-HPC programmers) can expect with sockets, there is little incentive to displace sockets.
Robustness is paramount outside of HPC — the crash of a single client cannot be allowed to take down a production database, for example. MPI applications tend to be single-user/job in nature; the issues of multi-user behavior have not been too relevant to MPI implementation correctness and robustness. However, growing scale of MPI applications is finally forcing the issue. When running an MPI job on thousands of servers, failures happen — MTBF issues simply cannot be ignored. The most common solution to MTBF issues is “fail the job and restart it,” which leads to the corollary that applications must checkpoint themselves in order to be able to restart in the middle of their job instead of starting completely over again.
Given that the MPI Forum is looking into robustness and error recovery issues for MPI-3, will MPI become a compelling story for non-HPC networking codes?
I don’t know. But it sure is interesting to think about.