Debugging parallel applications is hard. There’s no way around it: bugs can get infinitely more complex when you have not just one thread of control running, but rather you have N processes — each with M threads — all running simultaneously. Printf-style debugging is simply not sufficient; when a process is running on a remote compute node, even the output from a print statement can take time to be sent across the network and then displayed on your screen — time that can mask the actual problem because it shows up significantly later than the actual problem occurred.
Tools are vital for parallel application development, and there are oodles of good ones out there. I just wanted to highlight one really cool open source (free!) tool today called “Padb“. Written by Ashley Pittman, it’s a small but surprisingly useful tool. One scenario where I find Padb helpful is when an MPI job “hangs” — it just seems to stop progress, but does not die or abort. Padb can go find all the individual MPI processes, attach to them, and generate stack traces and display variable and parameter dumps for each process in the MPI job. This allows a developer to see where the application is hung — an important first step in the troubleshooting process.
Ashley and some engineers from Sun worked on integrating Padb into Open MPI’s nightly regression testing framework (called the MPI Testing Tool, or MTT). Now, when a test code hangs in our nightly test regression runs, Padb is used to generate stack traces, etc. Here’s some text from Ashley’s announcement:
Thanks to Sun who have integrated padb with MTT there are now padb traces in the automated testing logs of OpenMPI for cases where the tests have timed out. These traces show the contents of MPI message queues, stack traces and where available local variables and parameters to functions.
Very cool (and genuinely useful) stuff!