With my curiosity and passion for innovation I often discover patterns around me that appear as pendulum motion, from one extreme to the other, back and forth. Plotted on a time line, they may appear as a sine curve, but looking from evolutionary point of view, a spiral is a much better representations. Sometimes those patterns even gravitate towards a common endpoint, such as a center of a spiral.
One such pattern is now in motion and starting to gravitate towards a center.
As you probably know, I'm beginning my fifth year of swimming in HPC waters. Sometimes it feels more like drowning, since some aspects of HPC are positively archaic. Mode of operation is one of them. HPC is usually one big static resource in front of which people with stacks of punch cards are queuing up and waiting for their turn. Despite the fact that the cluster is composed of many individual machines, people are taught to look at it and use it as one single machine.
Yet some other aspects are bleeding edge developments and those are really exciting. These days it appears that even HPC people are becoming aware of their archaic points and are actively looking for other developments in the neighboring areas, such as clouds. Enter the "HPC cloud" arena.
There have already been some well publicized examples where HPC style problems were successfully solved on what is today known as "cloud" in a satisfactory manner. Unfortunately they're very few. Cloud it seems was designed to fit a different role and majority of HPC jobs do not fit well to the cloud model.
Lets take a look at how cloud evolved to the current level. Back in the day when web was young, scaling meant buying the largest machine you could buy. When even the largest machines became too small for some of the top sites of the dotcom bubble era, people realized that just throwing money at the hardware would not solve all their problems. So smart people started to think radically different and implemented software solutions for horizontal scaling. Which brought the need to have some number of equal machines configured to play a specific role, with their number varying based on the requirements of the moment, such as request rate. Developers were told to deploy each component to a specific server.
With commodity virtualization solutions this became relatively easily to achieve. Today you deploy your web app stack on a cloud, specify some min and max number of instances, configure some elastic load balancer and off you go.
While this is suitable to large majority of web presence, some of the top players figured out that this is still not good enough for them. They were forced to tear their web app stack apart and rethink every piece of it. What they came up with is something that resembles a traditional HPC to a suprising level.
What motivated me to write all this is that I recently discovered something called Mesos. They claim their product is a "datacenter operating system", but I'd wait a bit before putting out such bold claims. From my HPC perspective it is just a resource management and queuing system done right.
As a HPC operator one of my largest complaints was that traditional HPC assumes a lot: all compute nodes are supposed to offer the same software environment, all jobs expect to run on bare metal without any hypervisors etc, the user interface to queuing and resource management is relatively rigid and inflexible, tailored to manual interaction via command line. Jobs have a hard limit of time they're allowed to run and queue doesn't care if they finish successfully or not. Because it's already enough work to keep one cluster and all of its components up and running, there were almost none experiments on how to adapt it to more general use. So it happened that utilization of majority of smaller clusters was way bellow commercially acceptable levels, which translated to no commercial interest of creating and offering HPC as a service. There were some attempts, but as far as I know, they are all limping along with support from state funds in one form or another. So what these HPC operators are thinking about very hard is how to enable their hardware to run existing cloud workloads, with added benefits they can provide, such as fast storage and interconnects.
On the other side large scale web app providers are tearing apart their app stacks, splitting them into frontend tasks with realtime requirements and backend tasks with batch oriented data processing. These batch tasks are so large that efficient use of infrastructure makes large financial gains on the operating costs and therefore a better position on the market. Which is motivation enough that these people are doing it.
Mesos is one such example of what they have come up with. The whole thing looks suprisingly like a modern HPC scheduler. Mesos itself provides just resource offers and then it is up to the user to implement a resource manager on top of it. Some of them already exist and cover most common use cases, such as Chronos (think of it as global cron) and Marathon (for longer living jobs). Twitter people implemented their own, called Aurora, that covers their own needs. A "job" in Mesos can be anything from a single unix command to a kvm instance. They ported MPICH and Torque to run on top of Mesos, so you can set it up in a way that is very familiar to a user of a traditional HPC. All these jobs execute in some form of container, with docker being the most popular option these days. This enables one to prepare suitable environments for each job up front and stash them away neatly in a docker repository.
What does software like Mesos makes possible for us to do? A single hardware infrastructure, capable of running both HPC style MPI jobs, cloud VMs, big data analytics, web app stacks and everything else that might appear in the future. Infrastructure that can again be addressed as one system. Something that has been until now more of a dream than a reality. I might be too excited about this right now, but I see this as an enabling technology for the era of utility computing.
And this is where the spirals join at the center.
Now off to find a sponsor ...