A recent exchange on the Open MPI users’ list turned up a minor bug in our code base. The bug had to do with how Open MPI reported a settings value through our configuration querying tool (“ompi_info”).
The code using the configuration value in question was doing the Right Things, but the tool was effectively reporting the wrong value. This led to some confusion on the mailing list, resulting in a bug fix being pushed upstream and the user concluding, “Trust, but verify.”
As applied to HPC: You need to trust your hardware and software vendors, but verify that both your system is working the way that you expect it to, and that your applications are getting the performance that they should.
MPI implementations — just like any network stack — are large, complex pieces of software. We try very hard to deliver perfect software, but even thought we don’t like to admit it, bugs happen. The bug in question revealed through this exchange with the user was pretty minor, but it did cause some confusion.
But sometimes the most nefarious bugs occur when users exercise our carefully developed software in ways that we didn’t anticipate, in environments that we didn’t expect. In these cases, you might run across unintentionally run across an untested code path, or something that causes an unexpected combination of inputs.
The moral of the story here is: trust, but verify. When something goes wrong, or even when everything is all going Right — verify, verify, verify.
It’s good science.