Cisco Blog > High Performance Computing Networking

Open MPI v1.6 released

May 14, 2012 at 9:29 am PST

Marking the end of over 2 years of active development, the Open MPI project has released a new “stable” series of releases starting with v1.6.

Specifically, Open MPI maintains two concurrent release series:

  • Odd number releases are “feature development” releases (e.g., 1.5.x).  They’re considered to be stable and test, but not yet necessarily “mature” (i.e., have lots of real-world usage to shake out bugs).  New features are added over the life of feature development releases.
  • Even number releases are “super stable” releases (e.g., 1.6.x).  After enough time, feature development releases transition into super stable releases — the new functionality has been vetted by enough real world usage to be considered stable enough for production sites.

Conceptually, it looks like this:

Read More »

Tags: ,

The Architecture of Open Source Applications, Volume II

May 8, 2012 at 10:48 am PST

AOSA 2 book cover

It’s finally out!  The Architecture of Open Source Applications, Volume II, is now available in dead tree form (PDFs will be available for sale soon, I’m told).

Additionally, all content from the book will also be freely available on aosabook.org next week sometime (!).

But know this: all royalties from the sales of this book go to Amnesty International.  So buy a copy; it’s for a good cause.

Both volumes 1 and 2 are excellent educational material for seeing how other well-known open source applications have been architected.  What better way to learn than to see how successful, widely-used open source software packages were designed?  Even better, after you read about each package, you can go look at the source code itself to further grok the issues.

Read More »

Tags: , , ,

The Common Communication Interface (CCI)

April 30, 2012 at 5:00 am PST

Today we feature part 2 of 2 in a deep-dive guest post from Scott Atchley, HPC Systems Engineer in the Technology Integration Group at Oak Ridge National Laboratory.

Given the goals described in part 1, we are developing the Common Communication Interface (CCI) as an open-source project for use by any application that needs a NAL. Note that CCI does not replace MPI since it does not provide matching or collectives. But it can be used by an MPI, probably as its NAL (likewise by a parallel file system). For applications that rely on the sockets API, it can provide improved performance when run on systems with high-performance interconnects and fall back on actual sockets when not.

How does the CCI design meet the previously-described criteria for a new NAL?

Simplicity: The API is compact. It uses a client/server semantic — like sockets — except that every endpoint can initiate and accept connections by default. An application can send a message (MSG) up to a certain size or initiate a remote memory access (RMA) of any size. The API is asynchronous and event-driven. The application can poll the endpoint for completed events or block on a OS-specific handle before retrieving events.

Portability: We have implemented multiple transports under the top-level CCI API for use everywhere, including: sockets, OFA Verbs for InfiniBand and RoCE (i.e., IB frames over lossless Ethernet), Cray GNI for Gemini, Cray Portals for SeaStar, and a native CCI transport over Myricom Ethernet. We are also working on a Linux Ethernet driver that can be used with any Ethernet NIC and does not require lossless Ethernet. It bypasses the IP stack and interacts directly with the Ethernet interface layer in the Linux kernel.

Performance: Our goal is to impose as little overhead on the native API as possible. By default, CCI is thread-safe — which therefore requires locking. We typically see an overhead of 400-600 ns when compiled as thread-safe. Without thread-safety, CCI adds 100-150 ns overhead. When using RMA, CCI’s throughput performs as well as the native API. When run over UDP sockets, we typically see 18-20 us half-round trip (HRT) latency. The above-mentioned Linux Ethernet kernel transport incurs 6-7 us HRT latency. RoCE gives about 3-5 us HRT latency, and vendor-specific Ethernet mechanisms (such as the Myricom native CCI transport) incurs about 2 us HRT latency.

Scalability: The CCI endpoint is the container of all network resources (e.g., send and receive buffers, completion queue, etc.); they do not increase with the number of connections. Although CCI uses connections, the typical connection state is limited to around 100 bytes. The application need only poll on the endpoint regardless of the number of connections to that endpoint.

Robustness: Because CCI uses connections, a single fault only disrupts that connection and not all other communication involving other peers. The application can optionally request that CCI send periodic keep-alive messages to determine if the peer process is still running. Reliable communications will time out if CCI cannot deliver them, and receivers can indicate when they do not have resources available to receive.

This barely scratches the surface of what CCI offers, but hopefully you get the general idea of what we are working to achieve. CCI is an open-source project under development by several organizations including Oak Ridge National Laboratory, Inria, the University of Tennessee at Knoxville, Myricom, Parallel Scientific, and others.

Please visit www.cci-forum.com for more information.

Tags:

Network APIs: the good, the bad, and the ugly

April 27, 2012 at 5:00 am PST

Today we feature a deep-dive guest post from Scott Atchley, HPC Systems Engineer in the Technology Integration Group at Oak Ridge National Laboratory.  This post is part 1 of 2.

In the world of high-performance computing, we jump through hoops to extract the last bit of performance from our machines. The vast majority of processes use the Message Passing Interface (MPI) to handle communication. Each MPI implementation abstracts the underlying network away, depending on the available interconnect(s). Ideally, the interconnect offers some form of operating system (OS) bypass and remote memory access in order to provide the lowest possible latency and highest possible throughput. If not, MPI typically falls back to TCP sockets. The MPI’s network abstraction layer (NAL) then optimizes the MPI communication pattern to match that of the interconnect’s API. For similar reasons, most distributed, parallel filesystems such as Lustre, PVFS2, and GPFS, also rely on a NAL to maximize performance.

Each of these software stacks maintains its own NAL and incorporates new interconnect APIs as needed. Any new applications that wish to run on high-performance clusters or supercomputers must choose between using MPI, TCP sockets, or the interconnect’s native API. Each has their strengths and weaknesses.

MPI: As outlined above, MPI is the default choice for most HPC applications. MPI provides a rich interface of point-to-point communication, collective operations, and remote memory access. Every MPI implementation tunes its NAL to achieve good performance. Unfortunately, MPI is a specialized API — most programmers have not used it. The matching interface can be quite powerful, but also confusing to new programmers. The connection model is communication groups (i.e., communicators) which implementations have not made robust (e.g., any single communication error between a pair of hosts can bring down the whole MPI job).

TCP sockets: The TCP sockets interface, on the other hand, has been around since the 80s and runs everywhere. It is well known and most programmers have experience with it. SOCK_STREAM (TCP) provides a reliable, in-order stream semantic that mirrors the UNIX “everything is a file” semantic. It also provides a simple to understand connection model with client and server. By default, the send and receive calls are blocking, but sockets can be used in a non-blocking mode. The downside is that sockets was designed in the 80s when networks were a lot slower and it was tailored for IP networking (especially TCP and UDP). All communication goes through the kernel which increases latency and all communication requires multiple copies which reduces throughput. Buffering for SOCK_STREAM is per connection as is checking for incoming messages.

Native: Lastly, a developer can port to the interconnect’s native API. This generally will lead to the best performance, but will require more work in the future if he ever wants to run on any other modern interconnect. These APIs are typically more complicated to use than the sockets API since they are designed with a particular set of hardware specifications in mind. Even fewer programmers have experience using native APIs than those using MPI.

We believe that there is a need for a common NAL that can serve both HPC and more general applications. A new NAL would need to be simple to use, portable to any interconnect (and provide optimal performance for that interconnect), scale to the largest HPC systems, and provide robustness in the presence of faults.

Want to know what this common NAL is?  Stay tuned for part 2!

Tags:

Hiring Linux Kernel hackers

April 22, 2012 at 7:02 pm PST

Just in case you didn’t see my tweet: my group is hiring!

We need some Linux kernel hackers for some high-performance networking stuff.  This includes MPI and other verticals.

I believe that the official job description is still working its way through channels before it appears on the official external Cisco job-posting site, but the gist of it is Linux kernel work for Cisco x86 servers (blades and rack-mount) and NICs in high performance networking scenarios.

Are you interested?  If so, send me an email with your resume — I’m jsquyres at cisco dot com.

Tags: , ,

Polling vs. blocking message passing progress

April 20, 2012 at 6:17 am PST

Here’s a not-uncommon question that we get on the Open MPI mailing list:

Why do MPI processes consume 100% of the CPU when they’re just waiting for incoming messages?

The answer is rather straightforward: because each MPI process polls aggressively for incoming messages (as opposed to blocking and letting the OS wake it up when a new message arrives).  Most MPI implementations do this by default, actually.

The reasons why they do this is a little more complicated, but loosely speaking, one reason is that polling helps get the lowest latency possible for short messages.

Read More »

Tags: ,

EuroMPI 2012: Call for Papers

March 30, 2012 at 5:00 am PST

It’s that time of year again — time to submit EuroMPI 2012 papers!

The conference will be in Vienna, Austria on 23-26 September, 2012.  Please come join us!  It’s an excellent opportunity to hear how real-world users are actually using MPI, find out about bleeding-edge MPI-based research, and hear what the MPI Forum is up to.

Here’s the official EuroMPI 2012 CFP:

BACKGROUND AND TOPICS

EuroMPI is the preeminent meeting for users, developers and researchers to interact and discuss new developments and applications of message-passing parallel computing, in particular in and related to the Message Passing Interface (MPI). The annual meeting has a long, rich tradition, and the 19th European MPI Users’ Group Meeting will again be a lively forum for discussion of everything related to usage and implementation of MPI and other parallel programming interfaces. Traditionally, the meeting has focused on the efficient implementation of aspects of MPI, typically on high-performance computing platforms, benchmarking and tools for MPI, short-comings and extensions of MPI, parallel I/O and fault tolerance, as well as parallel applications using MPI. The meeting is open towards other topics, in particular application experience and alternative interfaces for high-performance heterogeneous, hybrid, distributed memory systems.

Read More »

Tags: ,

The last new things in MPI-3

March 28, 2012 at 5:15 am PST

I know we’ve been talking about new MPI-3 things for forever.  But this is the last list of new things.

I promise.

Really.

I can say this with certainly because the Forum’s March meeting was the deadline for all new proposals to make it into the MPI-3 standard.  Anything else will have to be in MPI-<next> (where <next> may be 3.1, or 4, or …11.  Shrug).

Because of the deadline, we had a blizzard of proposals finally get into shape to be presented to the entire Forum.  Let’s talk about some of the more interesting ones…

Read More »

Tags: ,

New Fortran MPI bindings are “in”! And other MPI-3 stuff…

March 26, 2012 at 8:33 am PST

As of March 7, 2012, the new “use mpi_f08″ bindings have been officially voted in to the MPI-3 standard.

Woo hoo!!

A few other minor corrections made it into MPI-3 at the same meeting, but they’re boring / not worth discussing.

What is worth discussing, however, are some proposals that passed their first (of two) formal votes to make it into MPI-3 at that same meeting:

Let’s give a few details on each of these…

Read More »

Tags: ,

Open MPI v1.5 processor affinity options

March 9, 2012 at 5:00 am PST

Today we feature a deep-dive guest post from Ralph Castain, Senior Architecture in the Advanced R&D group at Greenplum, an EMC company.

Jeff is lazy this week, so he asked that I provide some notes on the process binding options available in the Open MPI (OMPI) v1.5 release series.

First, though, a caveat. The binding options in the v1.5 series are pretty much the same as in the prior v1.4 series. However, future releases (beginning with the v1.7 series) will have significantly different options providing a broader array of controls. I won’t address those here, but will do so in a later post.

Read More »

Tags: , , , ,

The New MPI-3 Remote Memory Access (One Sided) Interface

February 29, 2012 at 11:34 pm PST

Today we feature a deep-dive guest post from Torsten Hoefler, the Performance Modeling and Simulation lead of the Blue Waters project at NCSA, and Pavan Balaji, computer scientist in the Mathematics and Computer Science (MCS) Division at the Argonne National Laboratory (ANL), and as a fellow of the Computation Institute at the University of Chicago.

Despite MPI’s vast success in bringing portable message passing to scientists on a wide variety of platforms, MPI has been labeled as a communication model that only supports “two-sided” and “global” communication. The MPI-1 standard, which was released in 1994, provided functionality for performing two-sided and group or collective communication. The MPI-2 standard, released in 1997, added support for one-sided communication or remote memory access (RMA) capabilities, among other things. However, users have been slow to adopt such capabilities because of a number of reasons, the primary ones being: (1) the model was too strict for several application behavior patterns, and (2) there were several missing features in the MPI-2 RMA standard. Bonachea and Duell put together a more-or-less comprehensive list of areas where MPI-2 RMA falls behind. A number of alternate programming models, including Global Arrays, UPC and CAF have gained popularity filling this gap.

That’s where MPI-3 comes in.

Read More »

Tags: , , ,

MPI-3 Fortran bindings prototype now available

February 27, 2012 at 5:00 am PST

At long last, Craig Rasmussen (from Los Alamos National Laboratory) and I are ready to publish our prototype implementation of the MPI-3 Fortran bindings.  The new MPI-3 Fortran bindings are coming up for their second vote at the upcoming MPI Forum meeting in Chicago; this public release satisfies the “must implement all new proposed behavior” requirement for proposals to get in MPI-3.

The good stuff:

Please download and give this implementation a whirl! We’d love to hear your feedback.

So let’s dive a little deeper into the details…

Read More »

Tags: , , ,

MPI-3 small new function: MPI_GET_LIBRARY_VERSION

February 25, 2012 at 8:16 am PST

There’s been some discussion of Big New Features in MPI-3 recently (and more are coming!).  But today, I want to talk about a little new feature.  Something small, but still useful.

I’m talking about a new function named MPI_GET_LIBRARY_VERSION.

Its only purpose in life is to shed some light on to the implementation that you’re using.  It returns a simple string, and is intended to be human-readable (vs. being machine-parseable). The format of the string is not mandated, but it is assumed that MPI implementations will include reasonably detailed information about their specific version.

Read More »

Tags: ,

Top 10 reasons why buffered sends are evil

February 13, 2012 at 5:00 am PST

I made an offhand remark in my last entry about how MPI buffered sends are evil.  In a comment on that entry, @brockpalen asked me why.

I gave a brief explanation in a comment reply, but the subject is enough to warrant its own blog entry.

So here it is — my top 10 reasons why MPI_BSEND (and its two variants) are evil:

  1. Buffered sends generally force an extra copy of the outgoing message (i.e., a copy from the application’s buffer to internal MPI storage).  Note that I said “generally” — an MPI implementation doesn’t have to copy.  But the MPI standard says “Thus, if a send is executed and no matching receive is posted, then MPI must buffer the outgoing message…”  Ouch.  Most implementations just always copy the message and then start processing the send. Read More »

Tags: ,

How many ways to send?

February 11, 2012 at 4:00 am PST

Pop quiz, hotshot: how many types of sends are there in MPI?

Most people will immediately think of MPI_SEND.  A few of you will remember the non-blocking variant, MPI_ISEND (where I = “immediate”).

But what about the rest — can you name them?

Here’s a hint: if I run “ls -1 *send*c | wc -l” in Open MPI’s MPI API source code directory, the result is 14.  MPI_SEND and MPI_ISEND are two of those 14.  Can you name the other 12?

Read More »

Tags: ,