I had a good conversation with an ISV yesterday who makes a popular MPI-based simulation application. One of the things I like to do in these kinds of conversations is ask the ISV engineers two questions:
- What new features do you want from the MPI implementations that you use?
- What new features or changes do you want from the MPI API itself?
You know — talk to the actual users of MPI and see what they want, both from an implementation perspective and from a standards perspective. Shocking!
One of the big items that came out of our discussion was a desire for better fault tolerance and/or resilience in MPI applications.
To be fair: fault tolerance is a big topic, and full of both difficult and contentious issues. But the big point that they wanted was actually surprisingly simple in concept:
When an MPI process fails (for whatever reason), guarantee that all other MPI processes that are stuck in blocking MPI API calls involving the dead process return with some kind of reasonable error code.
They didn’t care too much about continuing MPI after that — they just wanted to know that an error occurred so that they could save some state to stable storage, perhaps print a helpful error message for the end user, or otherwise clean up after the run. This is a considerably smaller goal than other fault tolerance efforts (e.g., to be able to continue an MPI job after a failure).
So let’s talk about fault detection.
It’s usually easy enough for an MPI implementation to figure out when the remote peer in a blocking send/receive operation has failed. Especially when the MPI is using some form of reliable network communication, because the networking layer will tell the MPI implementation when it can no longer reach a peer.
…but not always. Consider:
- Perhaps the network has totally failed between the two peers, such that not even negative acknowledgements (NAKs) can flow between them (i.e., one process can’t tell the other that it has failed). Put differently: in the steady state of an MPI job, silence between peers rarely means process failure.
- Perhaps the MPI implementation is using unreliable data transports (e.g., UDP or other unreliable datagrams). Losses are then both common and expected — meaning that NAKs can get corrupted or lost.
- The remote peer may not be in the MPI library, or otherwise may not be actively sending traffic to the local peer (e.g., the remote peer may not have posted the matching send or receive yet). Again: silence may not mean failure.
Many of these kinds of issues can be resolved in an “out of band” control network — e.g., the run-time system can monitor the individual processes in an MPI job, and can signal its peers in the event of an unexpected death. …but there are scalability issues with this kind of approach, too. Let’s not forget prior blog entries where I have discussed scalability challenges in MPI/HPC runtime systems.
The situation gets even more complex if there are non-blocking communications ongoing involving many peers, some of whom may have failed.
And it gets further complexified (I just made up that word; deal with it) when your processes fail partway through collective, dynamic process, or one-sided operations. Hardware support (potentially from the network) may be required to handle such failure detection efficiently. Or, put differently: we do not want to penalize the performance of the far-more-common case of success by adding a lot of invasive and potentially performance-costing infrastructure to check for failure during MPI operations.
I should note that a flavor of this kind of failure detection is currently included in the MPI Forum Fault Tolerance Working Group’s (FTWG) proposal for MPI-4 (in addition to other FT-related provisions). This is quite promising.
But there’s still much discussion that must occur; other users want more than “simple” failure detection, for example — they want some kind of recovery (different models of which are under hot debate).
What kinds of failure detection and/or recovery would you find useful in your application?