Making MPI survive process failures
Arguably, one of the biggest weaknesses of MPI is its lack of resilience — most (if not all) MPI implementations will kill an entire MPI job if any individual process dies. This is in contrast to the reliability of TCP sockets, for example: if a process on one side of a socket suddenly goes away, the peer just gets a stale socket.
This lack of resilience is not entirely the fault of MPI implementations; the MPI standard itself lacks some critical definitions about behavior when one or more processes die.
I talked to Joshua Hursey, Postdoctoral Research Associate at Oak Ridge National Laboratory and a leading member of the MPI Forum’s Fault Tolerance Working Group to find out what is being done to make MPI more resilient.
Josh tells me that the Fault Tolerance WG is working on multiple proposals, one of which specifically deals with definitions for what happens when an MPI process dies.
The MPI Forum standardization body created the Fault Tolerance Working Group to define a set of MPI semantics and interfaces to enable fault tolerant applications and libraries to be portably constructed on top of the MPI standard. One proposal is focused on surviving failures, and is based in two principles:
- A failed processes only affects those alive processes with which it communicates directly (e.g., point-to-point operations) or indirectly (e.g., collective operations) after the process fails.
- Once a process is informed of the failure of a peer process — e.g., via return code from an MPI function — it will continue receiving an error on subsequent communication operations involving that failed peer process until the failure is explicitly “recognized” by the process.
In short, once a process fails, point-to-point MPI calls involving that peer will safely fail. Once the failure is “recognized,” the failed process will be treated as if it is MPI_PROC_NULL — point-to-point communications to that peer will be a no-op, but return successfully.
Additionally, collective operations are safely disabled on all communicators that include the failed peer. Once all surviving processes recognize the failed process, collective operations are re-enabled and will simply ignore the failed process(es). For example, an MPI_GATHER on a communicator with a recognized failed process will leave its buffer position unaltered in all other processes.
There are many more details to the proposals, of course.
But Josh’s points give me hope that MPI will be able to be used in faulty environments — potentially even outside of the static, homogeneous, usually-error-free, datacenter-centric, traditional parallel computing environments (to include scaling up to tens of thousands of processors, or even non-HPC applications).