Here’s a great quote that I ran across the other day from an article entitled A short history of btrfs on lwn.net by Valerie Aurora. Valerie was specifically talking about benchmarking filesystems, but you could replace the words “file systems” with just about any technology:
When it comes to file systems, it’s hard to tell truth from rumor from vile slander: the code is so complex, the personalities are so exaggerated, and the users are so angry when they lose their data. You can’t even settle things with a battle of the benchmarks: file system workloads vary so wildly that you can make a plausible argument for why any benchmark is either totally irrelevant or crucially important.
This remark is definitely true in high performance computing realms, too. Let me use it to give a little insight into MPI implementer behavior, with a specific case study from the Open MPI project.Competition is fierce in HPC — this was true even before the world economy shrunk IT spending. Indeed, competition is rampant not only among research / academic groups (competing for funding), but also between different open source software projects (competing for industry backing, glory, and sometimes funding). It has become more crucial than ever to have “the best performance”, proving why someone should use your software/hardware rather than their software/hardware.So let’s just tweak a few words in the previous quote:
When it comes to MPI implementations, it’s hard to tell truth from rumor from vile slander: the code is so complex, the personalities are so exaggerated, and the users are so angry when their jobs perform poorly. You can’t even settle things with a battle of the benchmarks: MPI workloads vary so wildly that you can make a plausible argument for why any benchmark is either totally irrelevant or crucially important.
Micro benchmarks are heavily used within the HPC community: ping pong latency and bandwidth, standalone collective operation performance, streaming behavior, time to complete parallel “hello world” at scale, etc. Numbers for such benchmarks are typically bandied about as conclusive proof why MPI xyz is better than MPI abc (“this benchmark is crucially important!”). While competition between HPC systems vendors and MPI implementations is arguably a Good Thing, the laser-sharp focus on micro benchmarks sometimes forces a difficult question: optimize for benchmarks or optimize for more realistic workloads?Here’s a concrete example: in the Open MPI v1.3 release series, we changed the default behavior on OpenFabrics-based networks to give (sometimes dramatically) better performance if you re-use the same communications buffers iteratively for large messages. In simple terms, large message latency went down and large message bandwidth went up in point-to-point benchmarks.Seems like this should be a no-brainer — why wouldn’t we have chosen to do this before?We actually had good reasons for not enabling this behavior by default:
- This optimization is only useful for applications that re-use large communication buffers. Such behavior is common to benchmarks, but is not always common in real-world workloads.
- Enabling this behavior adds a good chunk of overhead to track the user’s message buffers, to include using unsafe and unreliable Linux system-call intercept techniques. These unsafe (but necessary) practices are known to crash some real-world MPI application (see [Footnote 1], below).
- In many workloads that do not re-use large communication buffers, there is negligible performance difference (specifically: the application is neither noticeably faster nor noticeably slower). But in at least some real-world MPI applications, there is a large performance impact — in the wrong direction (read: slower).
- Enabling this behavior consumes additional operating system and communication hardware resources, which can have subtle-yet-wide-reaching effects on overall performance.
- This benchmark-friendly behavior has always been available in Open MPI via a command-line parameter. Users could run any MPI job with or without this behavior quite easily.
For all these reasons, Open MPI chose to exhibit this particular “non-benchmark” behavior out of the box. In short: we felt it was better for real-world applications, even though it didn’t look good in some benchmarks. Unfortunately, now matter how many times we explained this point, others would benchmark Open MPI in “non-benchmark mode” and then compare the results to another MPI implementation and say, “Look, Open MPI’s performance is terrible!” (read: other MPI’s made the opposite decision — they chose to optimize default behavior for the benchmarks).So we gave up and switched Open MPI’s default behavior. It’s still not something I’m pleased about — I do feel that some real-world applications will definitely experience a negative performance impact — but it had to be done to retain the appearance of being competitive. That is: Open MPI was always competitive — it just didn’t look that way because we honestly felt it was better for real-world / non-benchmark performance. Bummer.The moral of the story here is that MPI implementations are complex. While benchmarks are a good and necessary method of evaluating HPC systems, they are not the only method. Micro benchmarks can be a good indicator of simplistic application behavior, but real-world MPI applications tend to be much more complex. The best thing you can do is run your own application(s) and see what happens — your application performance is the main reason you’re buying an HPC system, right? Then dig deeper to figure out exactly why you are seeing the performance that you are seeing — these are complex systems, after all. Talk to your favorite MPI guru. Analyze the network traffic and resource consumption on the computational units. And remember that optimizing metric X may cause a degradation in metric Y.It is very possible — and probable! — that you can optimize the system for your application’s performance. But be wary of making broad “this is bad” and “this is good” assumptions without a good measure of understanding of the underlying system. Come talk to us; you’ll likely find that your HPC integrator wants you to get good performance — it’s good for business!
[Footnote 1] Thanks to some work by Roland Dreier at Cisco, a Linux patch has been submitted upstream that provides a safe and reliable intercept interface (v3 of his patch is archived here). The patch hasn’t been accepted yet, but Roland is iteratively working on making the changes palatable to the Linux kernel community. I’m cautiously optimistic that with Roland’s new kernel module, these unsafe intercept practices will be a dark chapter in our history that we shall choose to never speak of again.