A new crowdsourced training method for developing large language models (LLMs) over the internet may shock the AI industry later this year with a massive 100 billion parameter model.
A new crowdsourced training method for developing large language models (LLMs) over the internet may shock the AI industry later this year with a massive 100 billion parameter model.
Researchers have utilized GPUs distributed globally, combining private and public data, to train a new type of large language model (LLM), indicating that the mainstream approach to building artificial intelligence may be disrupted. Two startups, Flower AI and Vana, have collaborated using unconventional methods to create this new model, named Collective-1.
Flower has developed technology that allows training to be distributed across hundreds of internet-connected computers. This technology has already been used by some companies to train AI models without the need for centralized computing resources or data. Vana provides data sources, including private messages from X, Reddit, and Telegram.
By modern standards, Collective-1 is relatively small, with 7 billion parameters—these parameters combine to give the model its capabilities—compared to today’s most advanced models like ChatGPT, Claude, and Gemini, which have hundreds of billions of parameters. Nic Lane, a computer scientist at the University of Cambridge and co-founder of Flower AI, stated that the distributed approach promises to far exceed the scale of Collective-1. Lane added that Flower AI is training a model with 30 billion parameters using conventional data and plans to train another model with 100 billion parameters later this year—close to the scale of industry leaders. “This could really change how people view AI, so we are working very hard on it,” Lane said. He noted that the startup is also incorporating images and audio into the training to create a multimodal model.
Distributed model building could also shake up the power dynamics shaping the AI industry. Currently, AI companies build models by combining vast amounts of training data with powerful computing capabilities centralized in data centers equipped with advanced GPUs and connected by high-speed fiber optic cables. They also heavily rely on datasets created by scraping publicly accessible (though sometimes copyrighted) materials, including websites and books.
This approach means that only the wealthiest companies and countries with large numbers of powerful chips can develop the most powerful and valuable models. Even open-source models, such as Meta's Llama and DeepSeek's R1, are built by companies with large data centers. The distributed approach could enable smaller companies and universities to build advanced AI by pooling different resources. Alternatively, it could allow countries lacking traditional infrastructure to network multiple data centers to build more powerful models.
Lane believes that the AI industry will increasingly seek new methods to break the training limitations of a single data center. He said, “The distributed approach allows you to scale computing capabilities in a more elegant way than data center models.”
Helen Toner, an AI governance expert at the Center for Security and Emerging Technologies, stated that Flower AI's approach is “interesting and potentially very relevant” to AI competition and governance. “It may continue to struggle at the cutting edge of technology, but it could be an interesting fast-follower approach,” Toner said.
Divide and Conquer
Distributed AI training involves rethinking the division of computation used to build powerful AI systems. Creating an LLM involves feeding a large amount of text into the model, which adjusts its parameters to produce useful responses to prompts. Within a data center, the training process is divided so that parts can run on different GPUs and are then periodically merged into a master model.
The new method allows work typically done within large data centers to be performed on hardware that may be miles apart and connected via relatively slow or unstable internet connections.
Some large companies are also exploring distributed learning. Last year, researchers at Google demonstrated a new computation partitioning and integration scheme called DIstributed PAth COmposition (DiPaCo), which makes distributed learning more efficient.
To build Collective-1 and other LLMs, Lane and academic collaborators from the UK and China developed a new tool called Photon to make distributed training more efficient. Lane stated that Photon is more efficient than Google's method in terms of data representation, sharing, and integrating training. The process is slower than conventional training but more flexible, allowing new hardware to be added to accelerate training.
Photon was developed in collaboration with researchers from Beijing University of Posts and Telecommunications and Zhejiang University. The team released the tool last month under an open-source license, allowing anyone to use this method.
Flower AI collaborated with Vana in its efforts to build Collective-1, with Vana developing new methods for users to share personal data with AI builders. Vana's software allows users to contribute private data from platforms like X and Reddit for training large language models and potentially specify allowed end uses, even profiting from their contributions.
Vana's co-founder Anna Kazlauskas stated that the idea is to make untapped data available for AI training while also giving users more control over how their information is used in AI. “This data is often not included in AI models because it is not publicly available,” Kazlauskas said, “This is the first time user-contributed data is being used to train foundational models, and users own the AI models created from their data.”
Mirco Musolesi, a computer scientist at University College London, stated that a key benefit of distributed AI training may be unlocking new types of data. “Scaling it to cutting-edge models will enable the AI industry to leverage vast amounts of decentralized and privacy-sensitive data, such as training in healthcare and finance, without facing the risks associated with data centralization,” he said.
What are your thoughts on distributed machine learning?
免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。