Why MPI is Good for You (part 2)
A while ago, I posted “Why MPI is Good For You,” describing a one-byte change in Open MPI’s code base that fixed an incredibly subtle IPv6-based bug.
The point of that blog entry was that MPI represents an excellent layered design: it lets application developers focus on their applications while shielding them from all the complex wilderbeasts that roam under the covers in the implementation.
MPI implementors like me don’t know — and don’t really want to know — anything about complex numerical analysis, protein folding, seismic wave propagation, or any one of a hundred other HPC application areas. And I’m assuming that MPI application developers don’t know — and don’t want to know — about the tricky underpinnings of how modern MPI implementations work.
Today, I present another motivating example for this thesis.
Let me give a little background…
Recently, my MPI development cluster at Cisco suffered a catastrophic failure. The Linux “Out Of Memory” (OOM) killer fired during my normal nightly MPI regression testing, and randomly killed the MySQL database daemon in order to get some memory back. This left the MySQL tables on disk in an unrecoverable state (MySQL is used by the Bright Computing cluster manager, <shameless plug>which is the awesome tool that I use to administrate my cluster</shameless plug>).
Even though Bright keeps a rolling 1-week window of SQL backups, it actually managed to keep running for about two weeks before I discovered the issue. I eventually ended up reverting to an ancient backup that required a lot of time and effort to bring back to my current state.
What on earth does this have to do with MPI?
I couldn’t figure out why the OOM killer had fired. I’ve been running my development cluster for years and never had a problem like this. True, it was random bad luck that OOM killed a sensitive piece of my cluster’s infrastructure, and even more bad luck that I didn’t notice it until too late.
But what caused my cluster head node to run out of memory in the first place?
It turns out that there was a subtle bug in the development version of Open MPI: there was a fairness/priority issue in funneling the stdout/stderr from MPI processes to that of mpirun. Specifically, the issue was in the code that allows you to “mpirun -np 4 a.out > output.txt” (i.e., send the output from all your MPI processes to the output of mpirun so that you can redirect it to a file).
The way this works is that each MPI process (effectively) bundles up its stdout/stderr and sends it across the network to mpirun. mpirun receives these messages and displays them on its own stdout/stderr. Simple, right?
Actually, it’s quite complex — the MPI processes themselves don’t bundle up their stdout/stderr; it’s all intercepted by an MPI helper daemon, tagged, and then sent to mpirun. If we didn’t do it this way, it would be really tricky (and unportable) to intercept/re-route an MPI process’ output before it invoked MPI_INIT and after it invoked MPI_FINALIZE. Yowza!
The bug was that we had inadvertently set mpirun’s priority of receiving network stdout/stderr messages to be the same as writing to mpirun’s stdout/stderr. As a result, if MPI processes continually hammered mpirun with new stdout/stderr, mpirun would not get a chance to actually display the output. mpirun would therefore buffer the incoming messages for later display.
This meant that if an MPI application continually displayed a LOT of output, the mpirun process could blow up to consume enormous amounts of memory, and eventually trigger the Linux OOM killer.
We have since fixed the issue in Open MPI (the bug wasn’t in any released version). It was a simple, 2-line code fix. But the journey to gain the knowledge to make that simple, 2-line fix represented some hard-won knowledge, including the catastrophic failure on my cluster.
This story underscores my original premise: implementing complex network IPC middleware is just plain hard. MPI is a Very Good Thing for shielding all of this junk from you.
You want your app developers focusing on the methods and the science of their applications. You do not want your app developers worrying about subtleties of network hardware, distributed run-time job control, and other systems-based issues. That’s what MPI is for.
SIDENOTE: if you’re an HPC cluster sysadmin, you might want to disable the Linux OOM killer on your cluster, or perhaps ensure that it doesn’t kill critical infrastructure daemons (yes, even on your compute nodes).