Most people immediately think of short message latency, or perhaps large message bandwidth when thinking about MPI.
But have you ever thought about what your MPI implementation has to do before your application even calls MPI_INIT?
Hint: it’s pretty crazy complex, from an engineering perspective.
Think of it this way: operating systems natively provide a runtime system for individual processes. You can launch, monitor, and terminate a process with that OS’s native tools. But now think about extending all of those operating system services to gang-support N processes exactly the same way one process is managed. And don’t forget that those N processes will be spread across M servers / operating system instances.
In short, that’s the job of a parallel runtime system: coordinate the actions of, and services provided to, N individual processes spread across M operating system instances.
It’s hugely complex.
Parallel runtime environments have been a topic of much research over the past 20 years. There have been tremendous advancements made, largely driven by the needs of the MPI and greater HPC communities.
When I think of MPI runtime environments, I typically think of a spectrum:
- On one end of the spectrum, there are environments that provide almost no help to an MPI implementation — they provide basic “launch this process on that server” kind of functionality. ssh is a good example in this category.
- On the other end of the spectrum is environments that were created specifically to launch, initialize, and manage, and large-scale parallel applications. These environments do everything behind the scenes for the MPI implementation; the bootstrapping functionality in MPI_INIT can be quite simple.
Put differently, there are many services that an MPI job requires at runtime. Some entity has to provide these services — either a native runtime system, or the MPI implementation itself (or a mixture of both).
Here’s a few examples of such services:
- Identification of the servers / processors where the MPI processes will run
- Launch of the individual MPI processes (which are usually individual operating system processes, but may be individual threads, instead)
- Allocation and distribution of network addresses in use by each of the individual MPI processes
- Standard input, output, and error gathering and redirection
- Distributed signal handling (e.g., if a user hits control-C, propagate it to all the individual MPI processes)
- Monitor each of the individual MPI processes and check for both successful and unsuccessful termination (and then decide what to do in each case)
That is a lot of work to do.
Oh, and by the way, these tasks need to be done scalably and efficiently (this is where the bulk of the last few decades of research have been spent). There are many practical, engineering issues that are just really hard to solve at extreme scale.
For example, it’d be easy to have a central controller and have each MPI process report in (this was a common model for MPI implementations did in the 1990’s). But you can easily visualize how that doesn’t scale beyond a few hundred MPI processes — you’ll start to run out of network resources, you’ll cause lots of network congestion (to include contending with the application’s MPI traffic), etc.
So use tree-based network communications, and distribute the service decisions among multiple places in the computational fabric. Easy, right?
Errr… no.
Parallel runtime researchers are still investigating the practical complexities of just how to do these kinds of things. What service decisions can be distributed? How do they efficiently coordinate without sucking up huge amounts of network bandwidth?
And so on.
Fun fact: a sizable amount of the research into how to get to exascale involves figuring out how to scale the runtime system.
Just look at what is needed today: users are regularly runing MPI jobs with (tens of) thousands of MPI processes. Who wants an MPI runtime that takes 30 minutes to launch a 20,000-process job? A user will (rightfully) view that as 29 minutes of wasted CPU time on 20,000 cores.
Indeed, each of the items in the list above are worthy of their own dissertation; they’re all individually complex in themselves.
So just think about that the next time you run your MPI application: there’s a whole behind-the-scenes support infrastructure in place just to get your application to the point where it can invoke MPI_INIT.
A big additional challenge in this space is making the whole assemblage fault tolerant, so that it can potentially keep running of some of those M servers, running a bunch of the N processes, cease functioning.
I strongly disagree with the statement that a 30-minute job launch time is unacceptable. This ignores the benefits of doing detailed diagnostics during job launch that dramatically reduce the unexpected job failure rate and improve performance reproducibility. Launching a job on 100K processes of Blue Gene/P took around 15 minutes but it was absolutely worth it compared to similarly sized machines that booted much faster but which had much lower reliability and reproducibility. The time the user loses for booting jobs is more than made up for by the reduction in faults by virtue of detecting hardware issues before MPI_INIT is called.
I think you’re comparing apples and oranges.
Sure, having the ability to have a slower, fully instrumented launch is a good thing. But most of the time, there isn’t a failure during launch, so why pay the penalty?
I think having a reasonable launch speed with the ability to report “common” errors is a Good Enough. And then having a slower launch speed that can report detailed errors for those who want/need more information.
Put it this way — if you give the user the following choice, “You can have a fast launch at scale that has less-detailed errors vs. a slower launch with more detailed errors”, in the common case (i.e., day-to-day runs), they’ll choose the faster launch every time.