Today we feature part 2 of 2 in a deep-dive guest post from Scott Atchley, HPC Systems Engineer in the Technology Integration Group at Oak Ridge National Laboratory.
Given the goals described in part 1, we are developing the Common Communication Interface (CCI) as an open-source project for use by any application that needs a NAL. Note that CCI does not replace MPI since it does not provide matching or collectives. But it can be used by an MPI, probably as its NAL (likewise by a parallel file system). For applications that rely on the sockets API, it can provide improved performance when run on systems with high-performance interconnects and fall back on actual sockets when not.
How does the CCI design meet the previously-described criteria for a new NAL?
Simplicity: The API is compact. It uses a client/server semantic — like sockets — except that every endpoint can initiate and accept connections by default. An application can send a message (MSG) up to a certain size or initiate a remote memory access (RMA) of any size. The API is asynchronous and event-driven. The application can poll the endpoint for completed events or block on a OS-specific handle before retrieving events.
Portability: We have implemented multiple transports under the top-level CCI API for use everywhere, including: sockets, OFA Verbs for InfiniBand and RoCE (i.e., IB frames over lossless Ethernet), Cray GNI for Gemini, Cray Portals for SeaStar, and a native CCI transport over Myricom Ethernet. We are also working on a Linux Ethernet driver that can be used with any Ethernet NIC and does not require lossless Ethernet. It bypasses the IP stack and interacts directly with the Ethernet interface layer in the Linux kernel.
Performance: Our goal is to impose as little overhead on the native API as possible. By default, CCI is thread-safe — which therefore requires locking. We typically see an overhead of 400-600 ns when compiled as thread-safe. Without thread-safety, CCI adds 100-150 ns overhead. When using RMA, CCI’s throughput performs as well as the native API. When run over UDP sockets, we typically see 18-20 us half-round trip (HRT) latency. The above-mentioned Linux Ethernet kernel transport incurs 6-7 us HRT latency. RoCE gives about 3-5 us HRT latency, and vendor-specific Ethernet mechanisms (such as the Myricom native CCI transport) incurs about 2 us HRT latency.
Scalability: The CCI endpoint is the container of all network resources (e.g., send and receive buffers, completion queue, etc.); they do not increase with the number of connections. Although CCI uses connections, the typical connection state is limited to around 100 bytes. The application need only poll on the endpoint regardless of the number of connections to that endpoint.
Robustness: Because CCI uses connections, a single fault only disrupts that connection and not all other communication involving other peers. The application can optionally request that CCI send periodic keep-alive messages to determine if the peer process is still running. Reliable communications will time out if CCI cannot deliver them, and receivers can indicate when they do not have resources available to receive.
This barely scratches the surface of what CCI offers, but hopefully you get the general idea of what we are working to achieve. CCI is an open-source project under development by several organizations including Oak Ridge National Laboratory, Inria, the University of Tennessee at Knoxville, Myricom, Parallel Scientific, and others.
Please visit www.cci-forum.com for more information.