Cisco Blogs
Share
tweet

More crazy MPI ideas: Fault detection and recovery

- November 14, 2015 - 4 Comments

I had a good conversation with an ISV yesterday who makes a popular MPI-based simulation application.  One of the things I like to do in these kinds of conversations is ask the ISV engineers two questions:

  1. What new features do you want from the MPI implementations that you use?
  2. What new features or changes do you want from the MPI API itself?

You know — talk to the actual users of MPI and see what they want, both from an implementation perspective and from a standards perspective.  Shocking!

One of the big items that came out of our discussion was a desire for better fault tolerance and/or resilience in MPI applications.

To be fair: fault tolerance is a big topic, and full of both difficult and contentious issues.  But the big point that they wanted was actually surprisingly simple in concept:

When an MPI process fails (for whatever reason), guarantee that all other MPI processes that are stuck in blocking MPI API calls involving the dead process return with some kind of reasonable error code.

They didn’t care too much about continuing MPI after that — they just wanted to know that an error occurred so that they could save some state to stable storage, perhaps print a helpful error message for the end user, or otherwise clean up after the run.  This is a considerably smaller goal than other fault tolerance efforts (e.g., to be able to continue an MPI job after a failure).

So let’s talk about fault detection.

It’s usually easy enough for an MPI implementation to figure out when the remote peer in a blocking send/receive operation has failed.  Especially when the MPI is using some form of reliable network communication, because the networking layer will tell the MPI implementation when it can no longer reach a peer.

…but not always.  Consider:

  • Perhaps the network has totally failed between the two peers, such that not even negative acknowledgements (NAKs) can flow between them (i.e., one process can’t tell the other that it has failed).  Put differently: in the steady state of an MPI job, silence between peers rarely means process failure.
  • Perhaps the MPI implementation is using unreliable data transports (e.g., UDP or other unreliable datagrams).  Losses are then both common and expected — meaning that NAKs can get corrupted or lost.
  • The remote peer may not be in the MPI library, or otherwise may not be actively sending traffic to the local peer (e.g., the remote peer may not have posted the matching send or receive yet).  Again: silence may not mean failure.

Many of these kinds of issues can be resolved in an “out of band” control network — e.g., the run-time system can monitor the individual processes in an MPI job, and can signal its peers in the event of an unexpected death.   …but there are scalability issues with this kind of approach, too.  Let’s not forget prior blog entries where I have discussed scalability challenges in MPI/HPC runtime systems.

The situation gets even more complex if there are non-blocking communications ongoing involving many peers, some of whom may have failed.

And it gets further complexified (I just made up that word; deal with it) when your processes fail partway through collective, dynamic process, or one-sided operations.  Hardware support (potentially from the network) may be required to handle such failure detection efficiently.  Or, put differently: we do not want to penalize the performance of the far-more-common case of success by adding a lot of invasive and potentially performance-costing infrastructure to check for failure during MPI operations.

I should note that a flavor of this kind of failure detection is currently included in the MPI Forum Fault Tolerance Working Group’s (FTWG) proposal for MPI-4 (in addition to other FT-related provisions).  This is quite promising.

But there’s still much discussion that must occur; other users want more than “simple” failure detection, for example — they want some kind of recovery (different models of which are under hot debate).

What kinds of failure detection and/or recovery would you find useful in your application?

Tags:
Leave a comment

We'd love to hear from you! To earn points and badges for participating in the conversation, join Cisco Social Rewards. Your comment(s) will appear instantly on the live site. Spam, promotional and derogatory comments will be removed.

In an effort to keep conversations fresh, Cisco Blogs closes comments after 60 days. Please visit the Cisco Blogs hub page for the latest content.

4 Comments

  1. I would love to be able to cheaply ask from every process in e.g. a communicator, if anything bad happened. I actually don't need anything more fine grained than that, but asking should be "relatively" fast, and scale to very large number of processes. Why? My apps use non-blocking communication exclusively. In a nutshell I have an "mpi-request" scheduler on its own thread, that stores a request, an identifier, and a call back to execute when the request is finished (which can schedule further non-blocking operations). At some synchronization points my other threads just wait for all request with a given identifier to finish before moving on. If a process stops responding it might take a while for everything to collapse (which wastes computing time, the core/hours just keep ticking). I would love to be able to ask at those synchronization points e.g. every 10 seconds, if everything is all right or something bad happen. A more fine grained solution would be to store a time stamp in the scheduler and, if after a certain amount of time an mpi request hasn't finished, ask its process if everything is all right. In my applications if an MPI process dies (or anything bad happens) I have data loss. The only thing I can do is fail as fast as possible to save computing time and display/log an error message. Wether the error message is "something bad happened with MPI" or "process XXX died" its not really that relevant.

      Agreed; a cheap cleanup is frequently the best way to go. Many MPI implementations do this already, but without intervention from the application. E.g., if the MPI implementation detects a process failure before it calls MPI_FINALIZE, it decides that the entire job should be killed, and does so. This works well when an individual MPI process has unexpectedly dies, but gets complicated if an entire node unexpectedly drops offline (e.g., someone kicks out a power cord). In that case, some kind of heartbeat mechanism is needed -- but care has to be taken that such a heartbeat mechanism scales and does not steal valuable network resources from the MPI application itself.

  2. While we are at crazy ideas: Has anyone ever thought about using an aproach similar to RAID for this kind of collective operation fault tolerance? I mean, specifying some kind of data redundancy and data striping, such that the collecitve operation can return as soon as there are just enough responses from other (potentially not all) MPI processes involved in the operation?

      There has been quite a bit of research over the years about redundant computation; such work also has touches upon distributed computing (vs. parallel computing -- both are similar, but have different characteristics). The idea of computing some unit of work N times and then deciding which is the "right" answer, or perhaps only *conditionally* re-computing work when it fails to be performed within a timeout (e.g., a manager/worker-style model) -- these are both common fault tolerant schemes. At least one problem with these approaches, however, is that they tend to be expensive: e.g., you have to buy additional hardware to perform the same amount of work. There can also be scalability issues (e.g., to scale manager/worker models, you really need multiple managers, and then the managers need to coordinate with each other, and depending on the coherence algorithms used, they may have difficulty scaling, too). The problem is that there is no silver bullet that works with all problems. A distributed manager/worker model with a trivial coherence protocol can potentially scale to arbitrary numbers of workers, but data movement can then become the scalability bottleneck. Additionally, not all computational problems can fit into a manager/worker paradigm.

Share
tweet