I actively read a bunch of HPC-related news feeds and blogs out there in the interwebs (how did I live before my RSS reader?). Two articles recently caught my attention:
Don’t get me wrong; I read lots of fine articles and blog entries. But these two seemed to resonate fairly well with some of the messages I’ve tried to convey here in my own blog.
In the “Renting HPC” article, Nicole Hemsoth talks about the proposition of renting time on HPC resources for organizations who don’t have the time, money, expertise, …etc. to buy/install/run their own.
Consider: some organizations only need to run a few HPC jobs now and then. Why have a dedicated resource? It doesn’t make fiscal sense. “Rentable” HPC businesses have become quite good at customizing environments for specific customer requirements, potentially even re-provisioning cluster nodes them on a per-customer/per-job basis. One of Nicole’s points in the article is that these services are sometimes marketing themselves with the trendy term “cloud“, which may be a bit confusing because HPC hasn’t yet embraced the idea of virtualization.
Indeed, I was initially a big nay-sayer of using virtualization in HPC. I thought it was a terrible idea because it just added lots more layers of software and hardware complexity for questionable gain. Proponents claimed that you got process checkpointing and migration because that’s all built into existing virtualization systems.
This point is somewhat true — but note that you still might need to involve the communications middleware in virtual machine checkpointing and migration. But then again, while many MPI solutions now include checkpoint/restart (and possibly migration) capabilities, they’re nowhere near as slick and easy to use as the point-n-click cluster-wide interfaces sported by popular commercial virtualization solutions.
I have come to realize over time that HPC is becoming commoditized. It’s not just the bleeding-edge types that want to run “big number crunching” jobs — even individual departments in big enterprise companies want to run simulation jobs. Such “HPC” users don’t necessarily value blazing-fast performance as the most important factor in their work: they submit their big number-crunching jobs when they go home at night and don’t really care if the jobs finish at midnight or at 4am — either way, they’re still asleep. As long as the jobs finish before they come in to work the next day, it’s all good.
Such users therefore don’t necessarily care about losing a few percent of performance when running under a virtual machine. They’re more interested in a solution that their server and network administrators already understand and know how to manage.
Virtualization for HPC might have a play in this kind of environment. Virtualization has a large ecosystem and watershed industry in itself; it has a lot of very nice, deeply-integrated tools, and a good amount of mindshare in IT departments. Even though MPI solutions can likely do checkpoint / restart / migration more efficiently than a virtualized solution, it might not matter. Virtualization has a huge head start with its pointy-clicky tools, integration into networking hardware, and corporate budgets.
It’s also an easy play for fault recovery (if a node fails — or even looks like it’s going to fail), an administrator — or perhaps even an automated resource manager — can save a parallel job to backing store and/or migrate it to nodes that aren’t failing. As compute clusters get larger, we all know that hardware failure becomes inevitable; having the ability to predict failures and proactively save jobs from needing to be checkpointed and restored could become quite the killer feature for enterprise-class HPC.
Huh. Could be interesting to see how this plays out.
…I was initially going to comment on Doug Eadline’s “Blowing the Doors…” editorial here, but this blog entry is already too long. I’ll save that for a future entry!