Avatar

The recent announcement of the Multipath Reliable Connection (MRC) protocol, developed by OpenAI, Microsoft, NVIDIA, AMD, Intel, and Broadcom, represents a major advancement in AI infrastructure. This deployment shows that full bisection bandwidth and near-perfect uptime are achievable at a scale of over 100,000 GPUs, a key milestone for training advanced models.

This announcement is notable not only for its scale but also for its architectural choices. MRC is built on SRv6 (RFC 8986), and its success highlights how foundational SRv6 innovation has enabled this achievement. Cisco has been a primary architect of SRv6 from the beginning — co-authoring the core IETF standards (RFC 8402, RFC 8754, RFC 8986) and has started shipping products as early as 2019 with large-scale deployments such as SoftBank. Cisco also played a major role in driving the ongoing SRv6 AI backend work in the IETF and has been embedding SRv6 across our silicon and platforms. The paper “Resilient AI Supercomputer Networking using MRC and SRv6” offers a detailed technical overview and is recommended reading.

While MRC’s multi-path transport stack is the primary innovation, it is important to consider how the underlying SRv6 architecture, developed and refined over the years at Cisco, enabled this design.

1. Application-Driven Networking

A core principle of SRv6 is to give applications control over their network experience. Implementing MRC at the transport layer creates a programmable network where the transport stack selects paths for each packet. This approach aligns with the model described in “SRv6: From 5G Networks to AI Infrastructure – A Journey of Innovation,” which details SRv6’s evolution from telecommunications to AI infrastructure.

By distributing packets across many paths and planes, MRC avoids the flow-collision issues common in traditional ECMP-based deployments. This is application-defined networking in practice, enabled by the programmability introduced by SRv6.

2. Reliability Through Static Simplicity

MRC disables dynamic routing in switches and instead uses static SRv6 source-routing, reflecting the architectural principles in “SRv6 for AI Backend.” Statically allocated segments enable the transport stack to handle link flaps and switch failures within microseconds, thereby avoiding delays caused by control-plane convergence.

In summary, simplifying the switch control plane and moving intelligence to the edge, both core SRv6 design principles, directly increase reliability for large-scale training jobs.

3. Deterministic Visibility and Integrated Performance Measurement

As outlined in “IP Is Better Than Ever with Integrated Performance Measurement,” a key advantage of SRv6 is deterministic probe pinning. Since paths are encoded as segment lists, source-routed probes follow the same physical path as data packets, providing accurate visibility into network health.

  • Path pinning: Each probe is assigned to a specific physical path, eliminating the ambiguity introduced by ECMP hashing.
  • Deterministic feedback: Probes return real-time health data to the source, allowing MRC to quickly identify and avoid failed paths.

This capability is fundamental to MRC’s design and results directly from SRv6’s architectural choices.

Looking Ahead

The strength of MRC lies in combining a robust SRv6 network foundation with an advanced RDMA-based transport implementation on the NIC, designed for the specific needs of AI workloads. This validates the SRv6 architectural vision of a programmable, application-driven fabric that meets today’s most demanding infrastructure requirements.

What makes SRv6 uniquely powerful for the next generation of AI supercomputers is that the same data plane operates holistically across:

  • Scale-Out — rack-to-rack connectivity within a data center, where MRC’s multi-path source routing over SRv6 eliminates ECMP hash collisions and delivers near-ideal bisection bandwidth across 100,000+ GPUs.
  • Scale-Across — data center-to-data center connectivity, where SRv6 removes the traditional fragmentation between DC and WAN domains, enabling distributed AI factories to operate as a single coherent cluster without state explosion in the underlay.

As AI supercomputers scale, the convergence of technologies such as SRv6 at the network layer and innovations like MRC at the transport layer will shape the future of AI infrastructure.

Read Will Eatherton’s blog about the SRv6 journey

Authors

Clarence Filsfils

Cisco Fellow

Engineering