As AI workloads shift from experimental to mission-critical, unexpected challenges test the assumptions underlying our networks, storage architectures, and security models. After nearly two decades of observing infrastructure evolution, I believe this moment is fundamentally different. We are not optimizing existing paradigms; we are rebuilding them.
The bandwidth wall and the rise of co-packaged optics
Modern AI training clusters require significant bandwidth. Training advanced models may involve tens or hundreds of thousands of GPUs exchanging data at speeds unimaginable just two years ago. Some clusters now exceed hundreds of petabits per second in total bandwidth, pushing traditional pluggable optics to their physical limits.
The industry is quickly adopting 102.4Tbps silicon as the standard for large-scale AI factories. The main bottleneck is no longer just how much compute power we have, but how fast data can move between chips, nodes, and memory. With 102.4Tbps, new networking silicon finally provides enough bandwidth to keep GPUs working at full capacity, reducing idle time and improving efficiency for hyperscalers and neoclouds. Whether through high-radix switching, advanced NICs, or co-packaged optics, 102.4Tbps is now the minimum needed for competitive AI clusters. It’s the new baseline.
As link speeds reach 800G, 1.6T, and beyond, the power needed for separate optical modules and electrical losses from the switch chip to the front panel create inefficiencies that are difficult to manage at scale.
Linear-drive Pluggable Optics (LPO) is becoming more important. By removing the digital signal processor (DSP) typically found in optical transceivers, LPO allows the host chip to connect directly to the optical module. This can cut power use by up to 50% per link and also lower latency and costs. For large operators building 800G and 1.6T connections to meet AI’s bandwidth needs, LPO is quickly becoming a core part of their systems.
Co-Packaged Optics (CPO) brings an even bigger shift in network design. By putting optical engines directly onto the switch package, CPO removes the electrical losses that limit bandwidth and efficiency. This leads to 30-40% less power use at the same speeds, better signal quality at higher data rates, and more ports than pluggable designs can offer.
CPO also expands network design possibilities. With sufficient connections, it can link clusters of 512 GPUs in a single layer or reduce larger setups from three layers to two. This eliminates extra switches, reduces latency, and simplifies the network.
Transitioning to CPO will take time and require new approaches to maintenance, cooling, and supply chain management. However, for large-scale AI, co-packaged optics are now essential.
Scale-across: Beyond the single cluster
AI networking has gone through several stages. Scale-up meant closely linking GPUs within a single system, using NVLink to treat an entire rack as a single computer. Scale-out took this further, using InfiniBand and Ethernet to connect thousands of GPUs across a data center, enabling today’s large clusters.
We are reaching the practical limits of scale-out. The largest training runs are now limited not by compute availability, but by the challenge of aggregating sufficient resources in a single location with adequate power, cooling, and network capacity. The next phase focuses on connecting clusters rather than simply building larger ones.
Scale-across treats compute resources across different locations as a single shared pool. This challenges old assumptions. Traditional distributed training assumes the same latency everywhere, but spreading across cities or continents introduces latency differences that disrupt standard operations.
To meet these new needs, we need large, secure routers with deep buffers that match the bandwidth and efficiency of switching chips. Routing and switching must be combined into a single solution. Data centers that do not adapt to these AI traffic changes risk performance problems and bottlenecks that could slow down AI work and growth.
New solutions are also appearing. Smart aggregation algorithms now take the network’s layout into account and optimize for it. Tasks are split so GPUs can keep working while data moves between distant sites, reducing latency. Systems learn to handle small delays in syncing, rather than requiring perfect timing. The network’s job is shifting from just providing fast, equal connections to smartly routing traffic across different types of paths.
Networks must now do more than provide speed; they need to understand their structure and make informed decisions about traffic routing. The control system is as important as the data system. Monitoring and observation are now essential components of network design.
Organizations that master scale-across will have access to computing power that single-cluster competitors cannot match.
Storage: The forgotten bottleneck
Most discussions about AI infrastructure focus on compute and networking, with storage often coming up later. This is an oversight.
AI storage requirements stress traditional architectures in unexpected ways. Training workloads combine sequential, read-heavy ingestion across petabytes of images, text, video, and multimodal content with frequent checkpoint writes/reads that can saturate storage fabrics during failure recovery.
Inference demands rapid access to model weights, and KV caches with strict latency SLAs—and as context windows grow, KV cache updates add sustained write pressure. Storage has become a performance bottleneck, not just a capacity planning exercise. When ingestion starves GPUs of data, when checkpoint bursts block training progress, or when KV cache latency delays token generation, accelerator cycles go idle. The economics are unforgiving: idle GPUs cost the same as busy ones.
In response, there has been a wave of new storage designs: distributed file systems built for AI, smart tiering that keeps active data on NVMe and moves older data to cheaper storage, and special caching layers between compute and storage. Network and storage are also converging, with RDMA-based protocols bypassing the usual OS layers to cut latency from milliseconds to microseconds.
The biggest change is that teams must design AI storage from the beginning, not added later. This requires teams working on training frameworks and storage to collaborate closely. It also means learning how different models use data and optimizing storage for those patterns.
Security in an era of valuable weights
AI models are valuable. Training a leading model can cost hundreds of millions of dollars. The weights, which are billions of parameters that define what the model can do, are both important assets and possible security risks.
Model theft, whether through network data exfiltration or insider misuse, presents risks that most security systems were not designed to address. The need for training clusters to transfer large volumes of data requires fast, accessible connections, which can increase vulnerability. Multi-tenant inference must maintain customer separation while delivering required performance for shared systems.
Security systems are changing to meet AI’s needs. They now include hardware-based trust from the accelerator up through the software, confidential computing that protects weights even from system operators, and network segmentation that separates real training traffic from possible data theft.
As AI systems grow to thousands of GPUs, securing the front-end network for control, storage, and management becomes a major challenge. Modern SmartNICs and Data Processing Units (DPUs) help by handling firewall tasks directly on the card, freeing the main CPU.
A DPU keeps track of each connection in its own memory and enforces network rules like IP filtering, session tracking, rate limiting, and protection against certain attacks, all at full speed and in a secure area separate from the main operating system. This hardware isolation makes DPUs a good fit for zero-trust security.
As an industry, we are also building security systems for threats unique to AI. Attackers can create inputs that trick models into making mistakes. They can corrupt training data to weaken a model before it is used. They can also test a model’s outputs to determine what private data it was trained on. These are not just theories—they are real risks and active areas of research.
Security for AI infrastructure is not just about meeting compliance rules. It is about protecting assets that may be worth more than the hardware they run on.
The path forward
Leading organizations are making infrastructure investments that reflect these realities. They are not only acquiring GPUs, but also building efficient connectivity, robust storage systems, and security architectures to protect the value they generate.
Decisions made in the coming years will determine which organizations can train and deploy the next generation of AI systems, and which will depend on external infrastructure.
For those building infrastructures, this is an exciting time. We are not simply maintaining legacy systems; we are establishing the foundations for the future.
Dive deeper into the announcements we made this week at GTC.
The infamous VMware tax has been superseded by the NVIDIA tax that creates a vendor lock-in situation that all savvy CIOs and CTOs must consider as they build out their AI infrastructure. As the AI compute demand shifts from Training to Inference, more infrastructure diversification options are now available for enterprise decision makers. Choose wisely.
It’s very nice
Great app
Great app i love it