Today we feature a deep-dive guest post from Scott Atchley, HPC Systems Engineer in the Technology Integration Group at Oak Ridge National Laboratory. This post is part 1 of 2.
In the world of high-performance computing, we jump through hoops to extract the last bit of performance from our machines. The vast majority of processes use the Message Passing Interface (MPI) to handle communication. Each MPI implementation abstracts the underlying network away, depending on the available interconnect(s). Ideally, the interconnect offers some form of operating system (OS) bypass and remote memory access in order to provide the lowest possible latency and highest possible throughput. If not, MPI typically falls back to TCP sockets. The MPI's network abstraction layer (NAL) then optimizes the MPI communication pattern to match that of the interconnect's API. For similar reasons, most distributed, parallel filesystems such as Lustre, PVFS2, and GPFS, also rely on a NAL to maximize performance.
Each of these software stacks maintains its own NAL and incorporates new interconnect APIs as needed. Any new applications that wish to run on high-performance clusters or supercomputers must choose between using MPI, TCP sockets, or the interconnect's native API. Each has their strengths and weaknesses.
MPI: As outlined above, MPI is the default choice for most HPC applications. MPI provides a rich interface of point-to-point communication, collective operations, and remote memory access. Every MPI implementation tunes its NAL to achieve good performance. Unfortunately, MPI is a specialized API -- most programmers have not used it. The matching interface can be quite powerful, but also confusing to new programmers. The connection model is communication groups (i.e., communicators) which implementations have not made robust (e.g., any single communication error between a pair of hosts can bring down the whole MPI job).
TCP sockets: The TCP sockets interface, on the other hand, has been around since the 80s and runs everywhere. It is well known and most programmers have experience with it. SOCK_STREAM (TCP) provides a reliable, in-order stream semantic that mirrors the UNIX "everything is a file" semantic. It also provides a simple to understand connection model with client and server. By default, the send and receive calls are blocking, but sockets can be used in a non-blocking mode. The downside is that sockets was designed in the 80s when networks were a lot slower and it was tailored for IP networking (especially TCP and UDP). All communication goes through the kernel which increases latency and all communication requires multiple copies which reduces throughput. Buffering for SOCK_STREAM is per connection as is checking for incoming messages.
Native: Lastly, a developer can port to the interconnect's native API. This generally will lead to the best performance, but will require more work in the future if he ever wants to run on any other modern interconnect. These APIs are typically more complicated to use than the sockets API since they are designed with a particular set of hardware specifications in mind. Even fewer programmers have experience using native APIs than those using MPI.
We believe that there is a need for a common NAL that can serve both HPC and more general applications. A new NAL would need to be simple to use, portable to any interconnect (and provide optimal performance for that interconnect), scale to the largest HPC systems, and provide robustness in the presence of faults.
Want to know what this common NAL is? Stay tuned for part 2!