Author: Techub News Compilation
In the latest episode of OpenAI's podcast, Mark Handley from the core networking team and load system engineer Greg Steinbrecher delved into the core challenges facing AI training infrastructure and shared the "Multi-Path Reliable Connection" (MRC) protocol they developed to address these issues. This conversation revealed the less known system engineering games behind cutting-edge AI research and how rethinking network protocols can pave the way for the training of the next generation of large models.
From Quantum Computing to AI Super Networks: A Physicist's Transformation
Greg Steinbrecher’s career began with a grand goal: to understand complex systems and build quantum computers. However, during his PhD studies, he realized that quantum computers had not yet become a reality and shifted his focus to chip design for controlling the photons in quantum computers. He had an epiphany: "This looks a bit like a network switch. What if we used it to make network switches?"
This idea led him into the world of data center networks. But he soon discovered that academia had little understanding of the real loads in data centers, and the models were overly simplified. To engage with real problems, he secured funding from an industrial company and began building initial network hardware to explore the true needs of data center networks.
He found that traditional data center network hardware still had a huge room for optimization without requiring his sophisticated optical chips. Coincidentally, the wave of AI was rising, and OpenAI needed to build massive GPU clusters and their networks. Greg's role shifted from writing simulation software to directly coding software that allowed GPUs to communicate with each other, eventually joining OpenAI, getting closer to the actual work of model training. His team's core task was to ensure the efficient utilization of GPUs: Was model training the fastest? Was the network becoming a bottleneck? How to efficiently restart when failures occurred? How to bypass hardware quirks? In short, it was about "extracting every bit of performance from the hardware."
AI Training: The "Worst" Challenge of Network Protocols
Mark Handley’s background is rooted in internet standards. He worked on making the internet support video conferencing, and the standards he helped develop are widely adopted by 4G/5G networks today. However, standardization requires global consensus, which is a lengthy process. Data center networks, on the other hand, only require consensus with the builders, which made him see an opportunity for innovation.
The rise of AI training has completely overturned the traditional design ideology of data centers. Traditional "hyperscale" data centers served the web era, with design teams separate from specific workloads, and the goal was simply to provide a "sea of compute." AI forced people to think in completely different ways. OpenAI particularly realized that system design is an integral part of model training, and that the infrastructure team must engage in "co-design" with the model teams. Greg's team sat side by side with researchers, discussing daily how to best match workloads to existing servers. They were on call for large training tasks, awakened in the middle of the night to handle failures. This close collaboration gave them a profound understanding of the pain points and led them to contemplate solutions for the next generation.
AI training poses a "worst-case" challenge for networks. Traditional internet communication relies on "statistical multiplexing": numerous independent communication streams share the network, and traffic tends to smooth out to an average. But AI training is the exact opposite: it involves the fastest thousands of GPUs collaborating to complete a single task. The communication between GPUs is itself part of the computation, and they need to synchronously exchange data to reach a consensus on the result of each computation step. Mark pointed out that this is "the worst network load you can imagine."
The key is, this is not about average communication speed, but about the absolute worst-case scenario. When thousands of GPUs communicate simultaneously, generating tens of thousands or even hundreds of thousands of network flows, you must examine the entire network to identify the bottleneck that is most severely impacted. The speed of this single link will determine the working speed and data transfer time of all GPUs since everything operates in sync. They could no longer rely on the average statistical advantage of the "law of large numbers," but were constrained by the "tail of the tail"—that is, the statistics of the 100th percentile (P100). This resulted in entirely different system requirements.
Another issue is the inevitable failures that come with scale. As the systems become extremely large, link failures and switch reboots will become the norm. Any failure will impact network traffic. In cases where only the "100th percentile" is of concern, a single link failure could lead to a long "failure window" before routing converges again, potentially causing a single communication transmission to fail and thereby triggering an entire training task to collapse. Hence, they needed to design a network protocol that could withstand transient congestion and hardware failures, enabling the system to continue running "almost seamlessly" during failures.
The failure rate increases linearly with scale. Greg illustrated this with simple math: if failures are independent, doubling the system size halves the average time between failures. More importantly, the number of components in the network far exceeds the number of GPUs. A GPU connects to a network adapter, which may have multiple lasers in the optical transceiver, and the other end also has lasers. Just connecting the GPU to the first-hop switch results in a number of lasers that is already an order of magnitude greater than the number of GPUs. Coupled with multi-layer switching, the number of components inside the network exceeds that at the network edge (GPUs) by several orders of magnitude. To provide sufficient bandwidth, they had to build enormous networks, which meant having millions of optical links within the network. Failures are ubiquitous.
MRC: A "Self-Healing" Network to Mitigate Congestion and Failures
To address these challenges, they developed the "Multi-Path Reliable Connection" (MRC). Its core insights contain several components.
Firstly, by "spraying" packets across multiple paths, very balanced load balancing can be achieved in the network. If the network topology has sufficient capacity, hotspots will not arise. But this also brings challenges: packets transmitted across different paths may be out of order (reordering). If congestion occurs and packets are lost, it can be challenging to determine whether a packet was lost or whether it has not arrived due to reordering.
Thus, they introduced a second technique: "packet trimming." When congestion occurs and queues overflow, the traditional approach is to discard entire packets, which creates uncertainty. MRC's method is to trim off the payload of the packet, forwarding only a very small header to the destination. The destination can immediately request a retransmission, completely eliminating the ambiguity over whether packets were lost due to congestion or awaiting due to reordering.
What does MRC mean for end users? The most immediate benefit is that OpenAI will be able to deliver better and smarter models more quickly. MRC accelerates every aspect of research and deployment. It allows individual users to worry less about task failures, task scheduling, or performance variations due to co-locating with other tasks in the same rack. It makes the training of cutting-edge models faster and more reliable, and the overall pipeline operates faster and more stably. Users will see increasingly exciting product release pipelines.
MRC was not invented from scratch; it is based on decades of research, combining existing technologies into a functional set. Last year, they finally deployed it, from hardware in place to training models running, in just a few months.
The results have been significant: they avoided the congestion issues previously discussed. More importantly, when a failure occurs somewhere in the network, all flows passing through that area may be affected, but the impact is minimal. Within several network round-trips, the system will stop using the faulty link. The issue of link failures causing network outages has been eliminated. Flows at the network interface automatically avoid failures during transmission, akin to "self-healing."
Mark added that in traditional networks, when a link goes down, one or both switches on either side need to notify all neighbors, which in turn notify their neighbors, creating a distributed system problem typically resolved by gossip protocols like Border Gateway Protocol (BGP) that require convergence time. MRC breaks this coordination requirement, allowing each endpoint to independently and quickly detect "this path should not be used" and immediately stop using it. This is much faster than waiting for a central authority (single point of failure) to distribute information. The convergence process can take seconds or even tens of seconds, while MRC can allow everyone to perceive and act within milliseconds.
Greg excitedly described deployment scenarios: during data center construction, due to a massive amount of manual operations (like fiber connections), links fluctuated frequently, far exceeding natural failures. But they didn’t mind at all, and didn’t even notice. MRC automatically detected and switched paths, which was nothing short of magical.
Another benefit of MRC is that since it can handle failures on its own, they decided to disable routing protocols in the network, using completely static routing at maximum scale. Some paths broken? It doesn’t matter; MRC will find paths that still work. This eliminates an entire chunk of complexity in network management. They no longer have to worry about whether the switch control planes converge, because it doesn’t need to. Configuration is set at startup, and thereafter the routing table never changes.
Open Standards and Industry Collaboration: Making Infrastructure a Shared Destiny
This significant achievement is the result of collaboration with numerous partners. They worked with Microsoft (responsible for building the Fairwater data center), NVIDIA, Broadcom, AMD, and Intel to standardize the MRC specification and build hardware for the new supercomputers.
The stability of MRC brings great advantages. Greg recalled that in the early days, the networking team often wore frowns due to training interruptions. Now, feedback regarding MRC cluster stability is "universally positive," and researchers no longer need to worry about it. Statistics show that failures have continued to occur, but they remain imperceptible to them.
Greg admitted that while they are still pushing the limits of infrastructure, the ideal world where researchers completely ignore infrastructure may never arrive, but each victory is marked by researchers no longer needing to know what network protocol a particular cluster is using. MRC has indeed helped them remove one of the key obstacles to continuing to scale and deliver better models.
They decided to open MRC for everyone to use. The specification will be published as an open standard through the Open Compute Project (OCP). OpenAI strongly believes in the power of open standards and open source. Their network is built on Ethernet (a selfless open standard). When the industry can keep up with their efforts in challenging areas, they will benefit as well. If everyone is deploying what they believe to be the best solutions, it is beneficial for all.
Mark expressed personally that if the supply chain for AI construction were to split due to investments in completely different technologies and underlying hardware in pursuit of minute advantages, it would be a real shame. He is very excited that MRC will become an open standard, which will not only benefit those outside of OpenAI but also help the entire industry move in the same direction. "Infrastructure is a shared destiny for the entire industry." Opening this technology and driving everyone forward is a very good thing.
Greg also agrees that in a context where computing resources are always lacking, maximizing collaboration to utilize resources is beneficial for everyone, much better than viewing it as a limited resource and each working in isolation. The history of protocols like Ethernet proves the enormous benefits of sharing. "What we need to do is already difficult enough; there's no need for everyone to reinvent the wheel from scratch."
Boundaries and Future of MRC: Simplifying Networks, Focusing on Efficiency
Where are the boundaries of MRC? It is a flexible standard built on Ethernet. As Ethernet expands, MRC will also grow. Ethernet is the protocol for communication between devices, and MRC is built upon it, merging static routing and congestion control. Network work is never-ending; there are always ways to improve and make networks fairer. There are fundamental limitations in networks, such as the speed of light being the known upper speed limit. However, as the speed of each link continues to increase, the operational points will constantly change, requiring ongoing engineering efforts to ensure optimal utilization of each generation of hardware. But MRC provides them with a flexible and powerful foundation to tackle the challenges of future generations.
The key is that MRC is based on Ethernet. Ethernet itself has continually evolved over the past 40 years, and they are leveraging all the advancements in the global networking industry, hoping to continue riding that wave of innovation. Because MRC pushes intelligence to the edges of the network, as long as Ethernet continues to expand, their network core can also scale accordingly, with no apparent reason in the near future to stop this expansion.
One key piece of work they have done is removing complexity from the network. As mentioned earlier, they have disabled routing, with each packet effectively being source-routed through the network. They utilize IPv6 segment routing technology, allowing each packet's address to list its exact sequence of switches traversed through the network. This means that the switches themselves can be quite “dumb”. Simplifying the network core has significant benefits for reliably scaling systems.
They continue to build on Ethernet because it is an open standard adopted and promoted by the entire industry. They hope MRC can do the same, evolving as the next layer to meet the challenges of AI systems and being widely adopted. They believe that if MRC is limited to exclusive use by OpenAI, it would not be as effective.
Another significant advantage of MRC is that due to its multi-path spraying properties, they can build simpler, smaller, and fewer devices in the network. This is not immediately obvious, but they can construct networks that are flatter, have fewer switch layers, consume less power, and have lower costs. The useful work that can be done per watt increases because additional power is not wasted on extra switch layers but is rather supplied more directly to GPUs for actual work.
As models evolve from text to multimodal, the demands of the systems are also increasing sharply. The volume of data needing to be moved and the latency constraints have become more stringent. The "regret" caused by slightly slower network speeds grows increasingly severe as the scale of training clusters increases and the rest of the system is optimized. OpenAI's advantage lies in many smart individuals pushing in the same direction: researchers optimize the work on the GPU to make it faster; this means tighter time constraints for network transmission. If the network lags behind, their work becomes irrelevant. Therefore, network work is never-ending.
Greg specifically noted that without MRC, simply increasing the number of paths would actually worsen the tail statistics. Because you are throwing the same number of "balls" into more "buckets," the ratio of the worst bucket to the average bucket worsens. The deterministic routing and fine load balancing mentioned by Mark on such a large number of links are crucial to avoiding falling into a bad situation. All layers of the entire tech stack are tightly coupled; network hardware personnel need to understand the workload layer, and workload layer personnel also need to understand the internal workings of network switches. Without this vertical integration and joint directional push, breakthroughs in system scale boundaries are impossible.
Finally, when discussing the geographical distribution of data centers and even the concept of space computing, Mark believes that the kind of training conducted at their Stargate data center would be difficult to implement in space; latency would be a significant issue, and failure rates would be problematic. Technicians on Earth repair equipment daily, which would be challenging to achieve in orbit. Greg thinks space computing is cool from the perspective of a physicist and dreamer, but from a practitioner's perspective, doing these things on Earth is already tough enough. Every day they are pushing limits on multiple dimensions; even starting MRC on Earth is a massive effort, requiring close collaboration with engineers from multiple companies and sometimes even requiring hands-on debugging of machines. Building, running, and optimizing these systems on Earth is already difficult enough, and adding extra complexity requires a strong justification. Therefore, the conclusion is: "Please build more ground computing centers." That is precisely their goal: to build a vast amount of computing power to increase the total intelligence of the world.
免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。