Cisco Blogs


Cisco Blog > High Performance Computing Networking

Making MPI survive process failures

February 27, 2011 at 12:00 pm PST

Arguably, one of the biggest weaknesses of MPI is its lack of resilience — most (if not all) MPI implementations will kill an entire MPI job if any individual process dies. ┬áThis is in contrast to the reliability of TCP sockets, for example: if a process on one side of a socket suddenly goes away, the peer just gets a stale socket.

This lack of resilience is not entirely the fault of MPI implementations; the MPI standard itself lacks some critical definitions about behavior when one or more processes die.

I talked to Joshua Hursey, Postdoctoral Research Associate at Oak Ridge National Laboratory and a leading member of the MPI Forum’s Fault Tolerance Working Group to find out what is being done to make MPI more resilient.

Read More »

Tags: , , , ,