The competition in the large-scale model market is heating up: technology is not the threshold, data is.

Original Source: Data Ape

Image Source: Generated by Wujie AI

In the past year of 2023, the most eye-catching topic in the Internet circle has been large models. Especially with the launch of domestic large models such as Wenxin Yiyuan, Xunfei Xinghuo, Baichuan, Tongyi Qianwen, and Hunyuan, "teasing" large models has become a daily activity for netizens. However, after running for several months, many Chinese language large models have shown a phenomenon of "mutual reference" during training.

In early December, Google launched Gemini, the largest and most powerful large model to date. However, shortly after its launch, netizens discovered that it seemed to use the corpora of other large models: when using Gemini Pro, if you continuously ask the two questions "hello" and "who are you" in Simplified Chinese, Gemini Pro will say "I am the Wenxin large model," and will confidently claim that its founder is Baidu's CEO, Robin Li.

This issue has actually occurred more than once. In March of last year, Google's Bard was exposed to using content from ShareGPT as training data, and according to The Information, this incident led to Jacob Devlin leaving Google; in December of last year, ByteDance was banned by OpenAI from using the API interface, because "ByteDance used ChatGPT to train its own AI, violating the terms of use."

According to statistics from the China Academy of Science and Technology Information, at least 130 companies in the country are researching large model products, with at least 79 large models with a scale of over 10 billion parameters already released, including 78 companies developing general large models and 52 companies developing vertical large models, covering various fields such as customer service, industry, medical, automotive, and gaming. And globally, more LLMs are also in training. Many companies will intentionally or unintentionally use datasets used by other large models for training, or directly use data generated by other large models for training.

The reason for "mutual reference" during training is that as the competition in the large model field has entered a white-hot stage, data has become the key to competition. An expert in the large model field stated that the starting gun for the competition in the large model market has already sounded, and "who does it faster" cannot dominate the competitive landscape; "who does it better" is the standard for market validation. It is difficult to distinguish the architecture, and data will become the key to "doing it well."

It's hard to distinguish the architecture of large models

"Who is stronger" is an important topic in the field of large language models (LLM). Since the birth of large language models, countless developers and researchers have been studying this issue. Data engineer Chen Feng believes, "Evaluating a large language model cannot be compared solely based on the amount of training data used. There are currently two relatively mature evaluation methods."

The first is to use a set of dialogues to test the language model. These dialogues include different questions and instructions, testing the language model's semantic understanding and extraction, chit-chat, context dialogue, generation and creation, knowledge and encyclopedia, code, logical reasoning, calculation, role-playing, security, and other indicators, and giving scores based on the correctness of their answers. Because there are multiple dimensions to the test, it will form several lists including comprehensive ability and classification ability.

Most Chinese benchmarks adopt this evaluation scheme, such as CLiB (Chinese Large Model Capability Evaluation List), SuperCLUE (Chinese General Large Model Comprehensive Benchmark), C-Eval (Chinese Basic Model Evaluation Suite), etc.

The second is the "arena" mode, such as the LLM arena list released by UC Berkeley, where users will have conversations with two different language models at the same time and mark the better one, with more positive reviews resulting in higher scores.

Overall, GPT-4 has overwhelmingly won in almost all lists, with Claude and GPT-3.5 also ranking high. In the Chinese list, Wenxin Yiyuan v2.2, Sensetime SenseChat, Xunfei Xinghuo v1.5, Baichuan-53B, and other language models are among the top performers, each with their own strengths in information extraction, reading comprehension, and data analysis capabilities.

The most noteworthy model on the list is the open-source model Llama2, which was released by Meta on July 19th this year and has caused a sensation in the field of large models. In the test results of its three variants with 70 billion, 130 billion, and 700 billion parameters, Llama2 defeated almost all commercial models except for GPT-4 and GPT-3.5. Chen Feng said, "Many manufacturers of self-developed large models are starting to consider whether to abandon self-development, use cheaper open-source models, or develop based on open-source models." As a Google engineer stated in an internal speech in May last year, "When free open-source models are of comparable quality to commercial models, people will not pay for restricted closed-source models."

More parameters are not necessarily better

Chen Feng believes that the open-source Llama2 fundamentally levels the playing field between commercial models. Before architecture makes a breakthrough, the competitiveness in the field of large language models has turned to the quality of training data.

In July 2023, a data leak reportedly from an OpenAI employee stated that OpenAI trained GPT-4 with 13 trillion tokens, and the 120-layer network had a total of 1.8 trillion parameters. Compared to other large models, the parameter volume publicly announced by leading domestic companies is usually at the level of hundreds of billions, while the parameter volume of other enterprises or startups' large models is usually at the level of tens of billions or billions.

Independent developer Wang Nan believes that there is nothing wrong with training AI with more data. The birth of large language models itself comes from the "emergence of intelligence" generated by stacking a large amount of data: when the data scale exceeds a certain limit, they will exhibit unprecedented new capabilities. Wang Nan said, "The parameter volume at which general large models exhibit the emergence of intelligence is generally considered to be 60 billion. Will more parameters once again lead to the emergence of intelligence? No one knows."

Adding more parameters to large models is very expensive. In addition to the cost of more data and a longer training period, the model also needs to be optimized as the number of parameters increases.

Large models cannot be developed overnight. To handle massive amounts of data, the model must be optimized for the processing of a large amount of data, and many engineering problems that do not appear when the data volume is small will arise. "It's like building a stadium, the problems faced by a stadium that can accommodate 5,000 spectators and one that can accommodate 100,000 spectators are definitely different," Chen Feng said, "Large models are the same, the more parameters, the more problems that need to be solved, and the higher the development cost. This cost increase is not linear, but exponential."

Unlike the exponentially increasing cost, the increase in parameter volume has limited help in improving model performance. "At the level of tens of billions or hundreds of billions of parameters, increasing the parameter volume has a significant effect. But at the level of trillions, the increase in parameter volume has a relatively small effect on the model's capabilities."

Therefore, controlling the model parameters within the range of hundreds of billions to trillions is the inevitable result after considering the comprehensive training cost and model capabilities.

Vertical large models become the commercial answer

When the architecture and parameter volume of large models are limited to a narrow range, where does the competitiveness in the field of large models come from?

Last year, The New York Times reported a bizarre case. A lawyer used ChatGPT to generate a defense statement and submitted it to the court, and the judge immediately found that more than ten cases cited in the defense statement were all fictitious. This phenomenon of artificial intelligence "talking nonsense" is called "AI illusion," and almost all large language models currently exhibit this problem.

Undoubtedly, these illusions are one of the key factors hindering the application of large models, and the industry currently has almost no solution to this problem.

The source of the problem is data. Wang Nan believes, "Once the high-quality data selection and training for large models are insufficient, the output quality of the large models will be affected, and illusions will follow. However, for general large models, transforming all human domain knowledge into high-quality data for training is obviously only a wishful thinking. The only solution is to train based on general large models for different scenarios, and the more vertical it is, the lower the probability of errors."

Based on this situation, outside of general large models, large models tailored to specific application scenarios in vertical domains have become the focus of competition in the large model field.

Wang Nan said, "Large models trained with high-quality data in vertical domains have stronger domain expertise and task specificity, and can better solve specific domain problems and provide more precise services."

Vertical large models have become the core of commercialization for large models, and leading players in the large model industry have successively launched Maas (Model as a Service) based on their own data, hardware, and models. Baidu has launched the Baidu Intelligent Cloud Qianfan Large Model Platform, Alibaba has launched the Moda Community, and Huawei has launched multiple models for different industries, such as Pangu NLP, Pangu CV, and Pangu multimodal.

Data quality determines the quality of vertical large models

In addition to hardware, the core of Maas is a large amount of data based on vertical domains.

The data used for training language large models is called "NLP dataset," which is structured data obtained by classifying textual materials from corpora and serves as the "textbook" for language large models. The datasets used by general large models often come from various sources such as books, web pages, news, and social media, forming the "knowledge base" of large models.

Wang Nan stated that some of this data comes from publicly available sources on the internet, known as "open-source datasets." The most well-known open-source dataset comes from Wikipedia. "The text in Wikipedia is very valuable because it is rigorously cited, written in explanatory text, and spans multiple languages and domains." As of September 15, 2023, the English Wikipedia has over 6 million entries and 59 million pages, containing over 4 billion words, which, after cleaning and filtering, can provide 3 billion tokens for large models.

However, high-quality web pages like Wikipedia are just an example. Although the content of web pages from other websites is also used to train large models, the total amount of this data is enormous, measured in petabytes, and can be obtained for free from providers like Common Crawl. The problem is that these web pages often contain a large amount of chaotic content, including a lot of pornography, violence, scams, and machine-generated junk information. Just cleaning, filtering, and annotating this part of the data requires a significant amount of manpower and resources.

High-quality open-source datasets are few and far between, especially datasets tailored to specific domains. Wang Nan said, "The few open-source datasets for vertical domains are often small in size, outdated, and difficult to use for building large models that can be used in specific application scenarios."

Therefore, the value of high-quality data has gradually become prominent in the era of large models, and data has become the core of competition for large model manufacturers.

Data is the moat of the large model era

Training large models tailored to specific application scenarios requires a large amount of proprietary data, such as dialogue, books, code, technical reports, and exam papers.

In the training process of leading models such as GPT-3.5, GPT-4, and PaLM, which rank high in the large model capability evaluation system, a large amount of proprietary data is used. According to publicly available information, the GPT-3.5 training data includes 2TB of high-quality book data and a large amount of social media dialogue data from platforms like Twitter and Reddit.

Proprietary data is often not open to the public. Last year, Reddit announced that it would start charging companies accessing its API, allowing external companies to download and process a large amount of dialogue from social networks for a fee. In July, the social network X (formerly Twitter) announced restrictions on the daily access of users to curb the scraping of data by AI companies for training models. In September, X changed its privacy policy and announced that it would start selling corpora based on user-generated content.

The data that can be purchased is only a small part of proprietary data. Wang Nan presented the composition of the GPT-3.5 training data, which included 2TB of book data, while the open-source Book3 dataset provided by The Pile is only about 85GB, several times smaller than the dataset used by OpenAI.

Many high-quality data are actually firmly held by internet companies. The massive data generated by users when using internet companies has become the "moat" of internet companies. The companies themselves can freely use this data, but it is difficult for other companies to obtain this data.

Take Meta as an example. Since its establishment, Meta has almost monopolized the global social media market through its social media platforms Facebook and Instagram. Services derived from social media, such as advertising, instant messaging, and VR, have also gradually taken a dominant position in the market. The data generated by users flows between different business lines of Meta, creating more value, ultimately creating a global internet giant.

In this system established by Meta, the technology developed by Meta itself and the data generated by users when using its products together form Meta's moat, making it difficult for other internet companies to provide the same services as Meta. In the field of language large models, Meta's open-source high-performance architecture Llama2 does not put Meta at a disadvantage in competition—Meta, with a large amount of high-quality data, inherently has a huge advantage, and there are few giants globally that can compete with Meta in this regard.

OpenAI is the same, but it has another moat: the data from user-AI conversations. One of the important reasons why OpenAI provides free access to ChatGPT to users is to collect this data for training new GPT models. This is also one of the reasons why major companies quickly provide free access to large language models to users.

As a Google engineer stated in an internal document, "We don't have a moat, and neither does OpenAI." In the case of excellent performance of open-source large models, the model itself cannot become the moat of internet companies; only data can give internet companies an advantage in the competition for large models.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

The competition in the large-scale model market is heating up: technology is not the threshold, data is.

Selected Articles by 巴比特

Table of Contents

Related Articles