Original Author: Egor Shulgin, Gonka Protocol
The rapid development of AI technology has pushed its training processes to the limits of any single physical location, forcing researchers to face a fundamental challenge: how to coordinate thousands of processors distributed across different continents (rather than within the same data center corridor)? The answer lies in more efficient algorithms—those that work by reducing communication. This shift, driven by breakthroughs in the field of federated optimization, has ultimately crystallized in frameworks like DiLoCo, enabling organizations to train models with billions of parameters through standard internet connections, opening up new possibilities for large-scale collaborative AI development.
1. Starting Point: Distributed Training in Data Centers
Modern AI training is essentially distributed. It is widely observed in the industry that scaling data, parameters, and computational power can significantly enhance model performance, making it impossible to train foundational models (with parameters reaching billions) on a single machine. The default solution in the industry is a "centralized distributed" model: building dedicated data centers that house thousands of GPUs at a single location and interconnecting them via ultra-high-speed networks (such as NVIDIA's NVLink or InfiniBand). The speed of these dedicated interconnect technologies is several orders of magnitude higher than standard networks, allowing all GPUs to operate as a cohesive whole system.
In this environment, the most common training strategy is data parallelism, which involves splitting the dataset across multiple GPUs. (There are also other methods, such as pipeline parallelism or tensor parallelism, which split the model itself across multiple GPUs, necessary for training the largest models, although they are more complex to implement.) The following illustrates how a training step using mini-batch stochastic gradient descent (SGD) works (the same principle applies to the Adam optimizer):
- Copy and Distribute: Load copies of the model onto each GPU. Split the training data into mini-batches.
- Parallel Computation: Each GPU independently processes a different mini-batch and computes gradients—the direction for adjusting model parameters.
- Synchronization and Aggregation: All GPUs pause work, share their gradients, and average them to produce a single, unified update amount.
- Update: Apply this averaged update amount to each GPU's model copy, ensuring all copies remain fully consistent.
- Repeat: Move to the next mini-batch and start over.
Essentially, this is a continuously looping process of parallel computation and enforced synchronization. The ongoing communication that occurs after each training step is only feasible under expensive, high-speed connections within the data center. This reliance on frequent synchronization is a hallmark of centralized distributed training. It runs perfectly until it leaves the "greenhouse" of the data center.
2. Hitting a Wall: Huge Communication Bottlenecks
To train the largest models, organizations now must build infrastructure at an astonishing scale, often requiring multiple data centers in different cities or continents. This geographical separation creates a significant barrier. The algorithmic approach that works well within a data center, with gradual synchronization, fails when stretched globally.
The problem lies in network speed. Within a data center, InfiniBand can achieve data transfer speeds of 400 Gb/s or higher. In contrast, the wide area networks (WAN) connecting distant data centers typically operate close to 1 Gbps. This represents a performance gap of several orders of magnitude, rooted in the fundamental limitations of distance and cost. The near-instantaneous communication assumed by mini-batch SGD is at odds with this reality.
This disparity creates severe bottlenecks. When model parameters must be synchronized after each step, powerful GPUs spend most of their time idle, waiting for data to crawl over the slow network. The result is that the AI community cannot leverage the vast computational resources distributed globally—from enterprise servers to consumer-grade hardware—because existing algorithms require high-speed, centralized networks. This represents a huge and untapped reservoir of computational power.
3. Algorithmic Shift: Federated Optimization
If frequent communication is the problem, then the solution is to reduce communication. This simple insight laid the foundation for an algorithmic shift that draws on techniques from federated learning—an area initially focused on training models on decentralized data from end devices (like smartphones) while preserving privacy. Its core algorithm, Federated Averaging (FedAvg), shows that by allowing each device to perform multiple local training steps before sending updates, the number of required communication rounds can be reduced by several orders of magnitude.
Researchers realized that the principle of doing more independent work between synchronization intervals is the perfect solution to the performance bottlenecks in geographically distributed settings. This led to the emergence of the Federated Optimization (FedOpt) framework, which employs a dual-optimizer approach to decouple local computation from global communication.
The framework operates using two different optimizers:
- The internal optimizer (such as standard SGD) runs on each machine, performing multiple independent training steps on its local data slice. Each model copy makes significant progress on its own.
- The external optimizer handles infrequent global synchronization. After several local steps, each worker node computes the total change in its model parameters. These changes are aggregated, and the external optimizer uses this averaged update to adjust the global model for the next cycle.
This dual-optimizer architecture fundamentally changes the dynamics of training. It is no longer frequent, stepwise communication among all nodes, but rather a series of extended, independent computation periods followed by a single aggregation update. This algorithmic shift, originating from privacy research, provides a crucial breakthrough for training over low-speed networks. The question is: can it be applied to large-scale language models?
The following is a schematic diagram of the federated optimization framework: local training with periodic global synchronization
Image Source: Charles, Z., et al. (2025). "Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo." arXiv:2503.09799
4. Breakthrough Progress: DiLoCo Proves Its Large-Scale Feasibility
The answer comes in the form of the DiLoCo (Distributed Low Communication) algorithm, which demonstrates the practical feasibility of federated optimization for large language models. DiLoCo provides a set of concrete, finely-tuned solutions for training modern Transformer models over low-speed networks:
- Internal Optimizer: AdamW, a cutting-edge optimizer for large language models, runs multiple local training steps on each worker node.
- External Optimizer: Nesterov momentum, a powerful and easy-to-understand algorithm, handles infrequent global updates.
Initial experiments indicate that DiLoCo can match the performance of fully synchronized data center training while reducing inter-node communication by up to 500 times. This serves as a practical proof that training giant models over the internet is feasible.
This breakthrough quickly garnered attention. The open-source implementation OpenDiLoCo replicated the original results and integrated the algorithm into a true peer-to-peer framework using the Hivemind library, making the technology more accessible. This momentum ultimately led to successful large-scale pre-training by organizations such as PrimeIntellect, Nous Research, and FlowerLabs, which demonstrated the successful pre-training of billion-parameter models over the internet using low-communication algorithms. These pioneering efforts transformed DiLoCo-style training from a promising research paper into a validated method for building foundational models outside centralized providers.
5. Cutting-Edge Exploration: Advanced Technologies and Future Research
The success of DiLoCo has sparked a new wave of research focused on further enhancing efficiency and scale. A key step toward maturing this approach is the development of DiLoCo Scaling Laws, which establish that the performance of DiLoCo can scale predictably and robustly with the growth of model size. These scaling laws predict that as models become larger, a well-tuned DiLoCo can outperform traditional data parallel training under a fixed computational budget while using several orders of magnitude less bandwidth.
To handle models exceeding 100 billion parameters, researchers have extended the design of DiLoCo with techniques like DiLoCoX, which combines the dual-optimizer approach with pipeline parallelism. DiLoCoX enables the pre-training of a 107 billion parameter model over a standard 1 Gbps network. Further improvements include Streaming DiLoCo (which overlaps communication and computation to hide network latency) and Asynchronous Methods (to prevent a single slow node from becoming a bottleneck for the entire system).
Innovation is also occurring at the core algorithmic level. Research into new internal optimizers like Muon has led to MuLoCo, a variant that allows model updates to be compressed to 2 bits with negligible performance loss, achieving an 8-fold reduction in data transfer. Perhaps the most ambitious research direction is model parallelism over the internet, which involves splitting the model itself across different machines. Early studies in this area, such as SWARM parallelism, have developed fault-tolerant methods for distributing model layers across heterogeneous and unreliable devices connected by low-speed networks. Based on these concepts, teams like Pluralis Research have demonstrated the feasibility of training billion-parameter models, where different layers are hosted on GPUs located in completely different geographical locations, opening the door to training models on distributed consumer-grade hardware connected only by standard internet.
6. Trust Challenges: Governance in Open Networks
As training shifts from controlled data centers to open, permissionless networks, a fundamental question arises: trust. In a truly decentralized system without a central authority, how do participants verify that the updates they receive from others are legitimate? How can malicious participants be prevented from poisoning the model, or lazy participants from claiming rewards for work they never completed? This governance issue is the final barrier.
One line of defense is Byzantine Fault Tolerance—a concept from distributed computing aimed at designing systems that can function normally even when some participants fail or act maliciously. In centralized systems, servers can apply robust aggregation rules to discard malicious updates. This is more challenging to achieve in a peer-to-peer environment, where there is no central aggregator. Instead, each honest node must evaluate updates from its neighbors and decide which to trust and which to discard.
Another approach involves cryptographic techniques that replace trust with verification. An early idea was Proof-of-Learning, which proposed that participants record training checkpoints to prove they have invested the necessary computation. Other techniques, such as zero-knowledge proofs (ZKPs), allow working nodes to prove they have correctly executed the required training steps without revealing the underlying data, although their current computational overhead remains a challenge for verifying the training of today's large foundational models.
Looking Ahead: The Dawn of a New AI Paradigm
The journey from towering data centers to the open internet marks a profound transformation in how artificial intelligence is created. We began at the physical limits of centralized training, where progress depended on access to expensive, co-located collaborative hardware. This led to communication bottlenecks—a wall that made training giant models on distributed networks impractical. However, this wall has not been broken by faster cables but rather by more efficient algorithms.
This algorithmic shift, rooted in federated optimization and embodied by DiLoCo, demonstrates that reducing the frequency of communication is key. This breakthrough is being rapidly advanced by various technologies: establishing scaling laws, overlapping communication, exploring new optimizers, and even parallelizing the model itself over the internet. The successful pre-training of billion-parameter models by a diverse ecosystem of researchers and companies is a testament to the power of this new paradigm.
As the trust challenges are addressed through robust defenses and cryptographic verification, the path is becoming clearer. Decentralized training is evolving from an engineering solution into a foundational pillar for a more open, collaborative, and accessible AI future. It heralds a world where the ability to build powerful models is no longer confined to a few tech giants but is distributed globally, unleashing the collective computational power and wisdom of all.
References
McMahan, H. B., et al. (2017). Communication-Efficient Learning of Deep Networks from Decentralized Data. International Conference on Artificial Intelligence and Statistics (AISTATS).
Reddi, S., et al. (2021). Adaptive Federated Optimization. International Conference on Learning Representations (ICLR).
Jia, H., et al. (2021). Proof-of-Learning: Definitions and Practice. IEEE Symposium on Security and Privacy.
Ryabinin, Max, et al. (2023). Swarm parallelism: Training large models can be surprisingly communication-efficient. International Conference on Machine Learning (ICML).
Douillard, A., et al. (2023). DiLoCo: Distributed Low-Communication Training of Language Models.
Jaghouar, S., Ong, J. M., & Hagemann, J. (2024). OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training.
Jaghouar, S., et al. (2024). Decentralized Training of Foundation Models: A Case Study with INTELLECT-1.
Liu, B., et al. (2024). Asynchronous Local-SGD Training for Language Modeling.
Charles, Z., et al. (2025). Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo.
Douillard, A., et al. (2025). Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch.
Psyche Team. (2025). Democratizing AI: The Psyche Network Architecture. Nous Research Blog.
Qi, J., et al. (2025). DiLoCoX: A Low-Communication Large-Scale Training Framework for Decentralized Cluster.
Sani, L., et al. (2025). Photon: Federated LLM Pre-Training. Proceedings of the Conference on Machine Learning and Systems (MLSys).
Thérien, B., et al. (2025). MuLoCo: Muon is a practical inner optimizer for DiLoCo.
Long, A., et al. (2025). Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism.
免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。