Avatar

Data is already a first-class citizen in many enterprises. Today, data is used to derive insights, influence decisions, automate and optimize operational efficiency. The above can be done by deterministic rules and policies or using Artificial Intelligence (AI)/Machine Learning (ML) wherein patterns are learned from observed data. In this move towards an AI/ML and Data driven enterprise, the decision maker is faced with numerous challenges including the AI lifecycle management.

Brief Background: AI/ML is an iterative engineering process wherein observed data (from a system) is used to “train” a mathematical model that learns a rich representation of real-world entities (e.g. malware images) based on prior observations (e.g. a set of malware and benign images). The trained model is then used for “inference” on future data (e.g. is this image a malware image?). Contrast this from traditional computing where one encodes a set of deterministic/procedural steps without the insight from prior data. These AI/ML workloads are often both very compute intensive (e.g. millions of matrix and tensor operations) especially during the training process.

The first phase in the AI lifecycle is “Data Definition and Cleaning” in which one or more data science teams process and analyze relevant data from diverse data sources using standard ETL (Extract/Transform/Load) steps. The cycle then moves to the second phase of “Discovery” (or training) in which teams build and validate mathematical models from the extracted and cleansed data. In the final “Deployment” phase, these models are put to production. Since this is a lifecycle, these phases are repeated as more data is collected, leading to better models, and hence better insights and decision making.

The natural questions that are asked at each step of the lifecycle are: 1) what infrastructure and tools should be used 2) should these steps be done in different teams organizationally or can we have a continuous process within a single team for the lifecycle and 3) how one can manage the costs, efficiency and obtain visibility. In addition, there are other important requirements e.g. compliance, security and privacy etc. At each stage, the line of business (LOB) decision maker is presented with a plethora of choices and questions: For example, for AI/ML workflows what are the correct data sources to use for modeling (and their own lifecycle questions like location, authorization, privacy, compliance etc)?. What toolchains should we use e.g. TensorFlow, Torch, Caffe, MXNet, CNTK etc? How do we manage the meta-data for all the models that are being discovered? For deployment, should we have different devops team put these models to production? To deploy what devsecops toolchains do we use e.g. Kubernetes, Docker, OpenShift or Spinnaker? What would be a simple and consistent AI strategy that would be future proof and not force us to lock in? This combinatorial explosion of choices makes planning and decision making a non-trivial problem.

The opportunities afforded by cloud are clear but given the realities of the multicloud world we live in, our investments need to be consistent across multiple public clouds, on-premises infrastructure, and even all the way to the edge. We need to carefully choose the technologies that will help us deploy models anywhere seamlessly without having to reinvest in the expensive data discovery process. One hypothesis towards a #ConsistentAI world is as follows: 1) bet on containers 2) bet on a robust AI/ML toolchain that can run everywhere 3) bet on a consistent AI/ML platform for the lifecycle management that does not lock us into a single toolchain. The first bet is an easy one as there is an industry wide shift towards containers and Kubernetes in particular. The second bet has several contenders but we believe TensorFlow is a reasonable one, and it is also beginning to dominate the space. The third bet is Kubeflow.

Kubeflow is a toolchain to help with the AI/ML lifecycle and aims to make deployments of ML models simple. I would argue it helps towards consistency too for the following reasons. First, it leverages the Kubernetes ecosystem which can run anywhere from edge devices all the way to public clouds. Second, it has a tight integration with TensorFlow while maintaining an open platform for other toolchains to integrate into, with serious engineering commitment from Google. Third, it has a diverse and a vibrant community with members ranging from cloud providers across the world to enterprise companies and startups – a critical factor for sustained innovation. Our team believes in the Kubeflow mission and we are actively contributing to their success. We hope you join the movement too, to move closer towards #ConsistentAI!

 



Authors

Debo Dutta

Distinguished Engineer

Cloud Services