2020 is expected to be the breakthrough year for Machine Learning (ML). Using data to create learnings, predications, and probability scores provides businesses with major competitive advantages in our dynamic and digital world – think “recommended for you” notifications when using your favourite apps or shopping.

Data scientists are the architects of ML-powered software and users of ML technology, gravitating towards public cloud platforms where they can find specialized stacks of ML software and tooling. Typically, these platforms are comprised of ML frameworks, such as TensorFlow or PyTorch, which provide the software libraries, structure, and computational workspace tools such as Jupyter Notebooks to develop and deliver ML-driven outcomes.

In addition to offering the ML tools required by data scientists, public clouds also provide components such as data services, software delivery platforms and, of course, the underlying infrastructure, managed by the provider. These components are necessary to deliver an end-to-end ML development environment in a simple, consumable fashion.

Public cloud ML platforms are a great place to start and develop ML based projects. However, data regulations and rising costs as projects expand mean that a growing number of organizations are forced to build their own ML platforms on-premises. This brings a fresh set of challenges given that most IT teams lack the specialized skills necessary to implement and support them. 

Why ML on-prem

Let’s expand on why some consider building ML stacks on-prem. An important part of the ML process is taking data and running it through a series of analytics, while applying algorithmic logic in a process known as “training”. This results in an artefact called a “model”, which is essentially a file that has been tuned to recognize certain types of patterns, for example determining what a person is asking for when using an automated voice-driven customer support system.

Creating this model, and ultimately deploying it to deliver an actual ML outcome, can be CPU/GPU intensive. This is why it can become costly to support a ML process on a pay-as-you-go basis in public cloud.

There are also important aspects to be considered regarding the data that feeds ML processes. It may be subject to specific regional governance due to regulation, or the data set might be too large to move to the public cloud – which could mean more complexity and cost.

This is where businesses consider deploying ML stacks on-premises, as they can be compliant with regulations and make data access and processing more financially predictable and viable.

Public cloud shows the way

ML workloads have been developed and hosted on-premises for many years using established data science software platforms such as H2O and RapidMiner. For the separation of workloads and their dependencies, data scientists use tools like Anaconda and Python’s VirtualEnv. All of this has been successfully hosted on traditional data center infrastructure.

Nevertheless, public cloud offerings generally provide a better user experience, as they enable treatment of infrastructure using common APIs, offering easy integration with automation tools and making it easier to deploy functionality faster, aligning with DevOps principles. The declarative approach of defining “infrastructure as code” (IaC) automates many of the supporting tasks, which is especially useful when aligning the underlying infrastructure with ML development platforms and, ultimately, the applications they service.

In order to imitate more of a public cloud-like experience on-prem for ML projects, data center components need to be designed similarly, implemented with appropriate APIs and Software Development Kits available. The good news is that the increase of API-driven compute, networking, and storage platforms now offer Infrastructure- as-a-Service layers in an automated fashion.

Despite this, an additional layer of abstraction is still required for ML users, as they don’t really want to spend time patching together and managing different APIs from heterogeneous infrastructure – unavoidably. An overarching common API-driven management platform that can take the burden of the underlying infrastructure away from developers is the gap.

Kubernetes and Kubeflow for on-prem ML projects


One such platform that can offer a common API for infrastructure is Kubernetes, a container orchestration system. This is a key requirements if you need to push some of the moving parts in an ML environment between different platforms – for example doing training in a public cloud and inferencing on-prem or at the edge. As ML components are injected into the containers managed by Kubernetes, they bring increased agility, portability, and application dependency management.

Kubernetes offers the flexibility and automation necessary for ML platforms, but ML configurations also require additional specialized components (eg components within TensorFlow or PyTorch) to create workflows of multiple services coming together in a sequenced fashion where they consume the underlying infrastructure.

Kubeflow Logo

This is where Kubeflow comes in. A project championed by Google with contributions from Cisco and other companies, it is dedicated to making ML on Kubernetes easy, portable, and scalable. Using pipelines, Kubeflow provides a straightforward way for spinning up ML environments and managing the ML workflow lifecycles, while offering the user experience necessary to both develop and deploy ML-driven projects. Effectively, Kubeflow acts like the platform on top of Kubernetes that “hosts” the additional ML-specific packages mentioned.

By design, the major advantage to Kubeflow is that it is can run on top of any Kubernetes cluster, thus providing the same user experience regardless of the underlying infrastructure platform. Naturally, this means overcoming the challenges users experience when moving between various platforms, on-prem or in public clouds. Its abstracted and layered approach allows various components of the ML pipeline to be hosted in different locations (on-prem data center or edge, public clouds) so that users can take advantage of the specialized services offered by various platforms or meet data gravity needs.

As you can see, this is starting to look promising for delivering on-premise ML capabilities, with potential hybrid configurations giving data scientists the best of both worlds, but it can still be hard to achieve, as configuring all this can still be a challenging tasks. In part two of this blog series, we will look at tools that can further enhance the on-premises ML experience.

To continue reading, checkout the part 2 in this blog series: Simplifying on-premises Machine Learning infrastructure with MLAnywhere.


Michael Doherty

Technical Solution Architect

WW Data Center Team