Avatar

Harnessing data is crucial for success in today’s data-driven world, and the surge in AI/ML workloads is accelerating the need for data centers that can deliver it with operational simplicity. While 84% of companies think AI will have a significant impact on their business, just 14% of organizations worldwide say they are fully ready to integrate AI into their business, according to the Cisco AI Readiness Index.

The rapid adoption of large language models (LLMs) trained on huge data sets has introduced production environment management complexities. What’s needed is a data center strategy that embraces agility, elasticity, and cognitive intelligence capabilities for more performance and future sustainability.

Impact of AI on businesses and data centers

While AI continues to drive growth, reshape priorities, and accelerate operations, organizations often grapple with three key challenges:

  • How do they modernize data center networks to handle evolving needs, particularly AI workloads?
  • How can they scale infrastructure for AI/ML clusters with a sustainable paradigm?
  • How can they ensure end-to-end visibility and security of the data center infrastructure?
Figure 1: Key network challenges for AI/ML requirements

While AI visibility and observability are essential for supporting AI/ML applications in production, challenges remain. There’s still no universal agreement on what metrics to monitor or optimal monitoring practices. Furthermore, defining roles for monitoring and the best organizational models for ML deployments remain ongoing discussions for most organizations. With data and data centers everywhere, using IPsec or similar services for security is imperative in distributed data center environments with colocation or edge sites, encrypted connectivity, and traffic between sites and clouds.

AI workloads, whether utilizing inferencing or retrieval-augmented generation (RAG), require distributed and edge data centers with robust infrastructure for processing, securing, and connectivity. For secure communications between multiple sites—whether private or public cloud—enabling encryption is key for GPU-to-GPU, application-to-application, or traditional workload to AI workload interactions. Advances in networking are warranted to meet this need.

Cisco’s AI/ML approach revolutionizes data center networking

At Cisco Live 2024, we announced several advancements in data center networking, particularly for AI/ML applications. This includes a Cisco Nexus One Fabric Experience that simplifies configuration, monitoring, and maintenance for all fabric types through a single control point, Cisco Nexus Dashboard. This solution streamlines management across diverse data center needs with unified policies, reducing complexity and improving security. Additionally, Nexus HyperFabric has expanded the Cisco Nexus portfolio with an easy-to-deploy as-a-service approach to augment our private cloud offering.

Figure 2: Why the time is now for AI/ML in enterprises

Nexus Dashboard consolidates services, creating a more user-friendly experience that streamlines software installation and upgrades while requiring fewer IT resources. It also serves as a comprehensive operations and automation platform for on-premises data center networks, offering valuable features such as network visualizations, faster deployments, switch-level energy management, and AI-powered root cause analysis for swift performance troubleshooting.

As new buildouts that are focused on supporting AI workloads and associated data trust domains continue to accelerate, much of the network focus has justifiably been on the physical infrastructure and the ability to build a non-blocking, low-latency lossless Ethernet. Ethernet’s ubiquity, component reliability, and superior cost economics will continue to lead the way with 800G and a roadmap to 1.6T.

Figure 3: Cisco’s AI/ML approach

By enabling the right congestion management mechanisms, telemetry capabilities, ports speeds, and latency, operators can build out AI-focused clusters. Our customers are already telling us that the discussion is moving quickly towards fitting these clusters into their existing operating model to scale their management paradigm. That’s why it is essential to also innovate around simplifying the operator experience with new AIOps capabilities.

With our Cisco Validated Designs (CVDs), we offer preconfigured solutions optimized for AI/ML workloads to help ensure that the network meets the specific infrastructure requirements of AI/ML clusters, minimizing latency and packet drops for seamless dataflow and more efficient job completion.

Figure 4: Lossless network with Uniform Traffic Distribution

Protect and connect both traditional workloads and new AI workloads in a single data center environment (edge, colocation, public or private cloud) that exceeds customer requirements for reliability, performance, operational simplicity, and sustainability. We are focused on delivering operational simplicity and networking innovations such as seamless local area network (LAN), storage area network (SAN), AI/ML, and Cisco IP Fabric for Media (IPFM) implementations. In turn, you can unlock new use cases and greater value creation.

These state-of-the-art infrastructure and operations capabilities, including our platform vision, Cisco Networking Cloud, will be showcased at the Open Compute Project (OCP) Summit 2024. We look forward to seeing you there and sharing these advancements.

Register for the webinar:

You’re Ready for AI. Is your Data Center?

 



Authors

Murali Gandluru

Vice President of Product Management

Cisco Networking – Data Center Networking