I met Jeff at EuroMPI in September, and he has invited me to write a few words on my experience of developing an MPI library.
My PhD involved building a message passing library using C#; not accessing an existing MPI library from C# code but creating a brand new MPI library written entirely in pure C#. The result is McMPI (Managed-code MPI), which is compliant with MPI-1 – as far as it can be given that there are no language bindings for C# in the MPI Standard. It also has reasonably good performance in micro-benchmarks for latency and bandwidth both in shared-memory and distributed-memory.
The “C# Issue”
Let’s first deal with the C# issue: what is it? Why use it? Is it any good?
C# and the .Net ecosystem were originally created by Microsoft. The specifications for C# and for the common-language-runtime (CLR) have been published and were standardised by ISO and ECMA. This has enabled other compilers and runtimes to be created, such as Portable.Net and Mono, opening up access to all platforms.
C# is similar to Java in both concept and practice. Common features include:
- object-oriented approach
- extensive library of objects to assist the programmer with day-to-day programming tasks
- write and compile code once, run the binary executable anywhere
- compilation to portable intermediate binary form (byte-code for Java, CIL for C#)
- just-in-time compilation enables excellent runtime performance
I chose C# for my PhD because there was no MPI library for C# except MPI.NET, which delegates to an existing MPI library that is external to C# and must be separately installed, configured and maintained.
In general, my experience of programming in C# is that it is easy to get something working but harder to optimise performance. For many tasks, the .Net Framework includes multiple ways of doing the same thing but the achievable performance will be different for each of them. In my opinion, the C# makes it easy to implement a good design but how maintainable the code is depends on how good the design is in the first place, and good design is just plain hard.
In designing my MPI library I embraced the object-oriented style, using a layered and modular architecture and incorporating six different common software design patterns (described by the “Gang of Four” in “Design Patterns: Elements of Reusable Object-Oriented Software”). I exploited a variety of software engineering design techniques from structural analysis to UML diagrams. Hopefully, this means the source code is easy to read, easy to maintain and easy to extend (for example, to add functionality from MPI-2 and MPI-3).
There are three layers: the interface layer exposes an MPI-like API to application programmers, the protocol layer deals with MPI point-to-point semantics (matching, progress and so on) and the third layer is a collection of modules, each of which implements an actual communication mechanism.
There are two communication mechanisms currently implemented: the first uses shared-variables to communicate between threads within an OS process and the second uses sockets to communicate between machines within a TCP/IP network.
The astute reader will now be thinking “shared-variables – don’t you mean shared-memory?” to which the answer is, of course, no. I have chosen to allow each thread in an application to call MPI_INIT_THREAD and become its own MPI process. Each OS process might contain several of these MPI processes, e.g. one per physical core. All the MPI processes that share the same OS process can communicate with each other using a shared variable. Anything that is declared “static” is visible by all threads in the OS process. McMPI uses this fact to pass messages between local MPI processes. It avoids one of the memory-copies that is normally essential for shared-memory message-passing and improves both the latency and the bandwidth of this type of communication.
Typically, latency between MPI processes that share the same physical memory address space is
- 1200ns for MPICH2,
- 900ns for MS-MPI and
- 400ns for McMPI
(These figures were obtained using a dual Xeon E5420 2.5GHz server machine – 8-cores in total – DDR 667MHz memory and a 1333MHz front-side bus).
So far, McMPI has only been tested on Gigabit Ethernet networks. In theory, the same code should run and perform well using 10G Ethernet with no changes but supporting InfiniBand will require a new communication module. The performance of McMPI was compared with MPICH2 and MS-MPI on three different Windows systems, including a cluster maintained by the University of Oxford, UK. On the Oxford cluster, McMPI introduced a latency overhead of approximately 8% for small messages but has comparable or better bandwidth for large messages.