Cisco UCS and HyperFlex for AI/ML Workloads in the Data Center
Data is the lifeblood of business. It helps drive deep insights and better decisions, improve processes, and offer a deeper understanding of customers, partners and business. Artificial intelligence and machine learning (AI/ML) enables us to learn from data, identify patterns and make better decisions that augment human capabilities. This provides businesses new ways to grow revenue, attract and retain customers and become more operationally efficient. Further, AI/ML can help automate tasks as well as accelerate untapped insight in previously unexplored areas. Almost all industries in every sector from banking to healthcare and manufacturing are trying to take advantage of these benefits.
There are a few different stages of machine learning and these require unique infrastructure and technology. The ‘training’ stage is when we are seeking to learn from existing data. The learning can be, for example, image recognition, voice recognition, buying preferences, etc. Training is compute intensive since the system must iterate numerous times on existing data with various parameters (and often multiple models) to generate and tune a final model. This model will be used later to drive insights from new data. The next step is ‘inference’, where you use the tuned model to draw quick inferences from new incoming data. Both training and inference can utilize GPUs, large memory footprints, and local storage to accelerate the process. However, the characteristics of the hardware and software platforms for training and inference are distinctly different – in terms of scalability, analytic power, programmability and economics.
Role of Data
What is the role of data in a AI/ML project? AI/ML has captured our imagination, but is simply the final few stages in the data pipeline. The start of this process to pool data – sometimes across multiple data silos – into a data lake where this data is cleaned and prepared before sending it to further down the data pipeline for analytics. Most of our customers have already started down this path. Data scientists will attest to the pain and diligent effort needed for data collection and preparation – often data scientists will spend 80% of their time on this. In general, more traditional and contemporary big data and analytics tools and frameworks are used in this data collection and preparation phase. To help our customers in this process, Cisco has created validated designs with technology partners in the data lake, big data and analytics space that enable the line of business and IT to work more effectively, reducing time to value. Now that we have helped our customers in this data pooling, analysis and preparation phase, we intent to replicate the same ease and time to value for AI/ML – both with our portfolio and with partners.
Building a production-ready artificial intelligence/machine learning system can pose some significant challenges. In many instances, the data science team will want to develop an AI/ML model on sensitive data. Data gravity and security issues often mean that the model training needs to be on premise, where the data lives. The physical location and compute requirements can vary dramatically depending upon the project. Hence, it is important to consider a platform that offers a robust mix of processors, form factors, and GPU support for AI/ML workloads. Our current portfolio includes our rack-optimized Cisco UCS C220, C240, and C480 servers and HyperFlex Systems that support from two to six NVIDIA GPUs. These systems can be located in close proximity to on-premise infrastructure that holds the data.
Accelerating Time to Deployment
Data scientists and engineers also spend significant amounts of time architecting a solution that combines infrastructure, multiple machine learning stacks, and management software. This is a huge time investment and needs to be simplified for the enterprise. Cisco’s AI/ML approach leverages our proven and tested solution approach that has enabled us to become the leader in market segments such as converged infrastructure and big data. With this model, Cisco changed traditional data center deployments, processes, and dynamics with a higher level of technology integration. This model has shown to accelerate time to deploy, simplify management and reduce operational costs.
As an example of extending this model, Cisco and Google are collaborating to combine UCS and HyperFlex platforms with industry leading AI/ML software packages like KubeFlow from Google to deliver on-premises infrastructure for AI/ML workloads. Initial focus is validation of KubeFlow on UCS/HyperFlex platforms. Going forward, AI/ML with KubeFlow on UCS/HX in combination with the Cisco Container Platform extends the Cisco/Google open hybrid cloud vision – enabling the creation of symmetric development and execution environments between on-premise and Google Cloud.
This collaboration is an example of steps we are taking aimed at helping IT teams easily deploy AI/ML on premise close to where the data resides, expediting time to insight. We are also engaging with other leading software partners in AI/ML space to offer choice to customers.
Non Siloed Operations
The IT industry moves fast. There are constantly new technologies, products, and solutions designed to drive better business outcomes. However, introducing new infrastructure, be it for AI/ML or any other workload, that is not operationally compatible with the existing IT environment can result in the creation of infrastructure and organizational silos.
We have taken a systems approach by designing UCS and HyperFlex from the ground up to deliver an easier way to deploy and manage IT infrastructure. We created a truly programmable, fabric centric design that combines Cisco compute, network, and storage infrastructure into a single scale out platform utilizing a simple, common management model. Our management approach is designed to simplify operations and reduce administrative costs, by enabling new architectural solutions based on UCS to seamlessly integrate into existing environments. We are extending those principles to AI/ML infrastructure platforms as well.
With the recent introduction of Cisco Intersight, a cloud-based management platform, we are able to provide global management capabilities for Cisco UCS and Cisco HyperFlex systems. Intersight makes it easy to connect and manage nodes, regardless of location and is itself augmented by analytics and machine learning capabilities to further automate and simplify the management of UCS based solutions.
At Cisco, we believe AI/ML can help organizations uncover and address new opportunities, facilitate deeper customer engagement, and better compete. It’s our role to provide cutting-edge solutions that simplify the deployment, management, and support of new technologies.