Is Chat-GPT the "reference answer," and even ByteDance is "copying homework"?

Image source: Generated by Wujie AI

As is well known, in the field of AI large models, OpenAI's development of Chat-GPT is like a particularly difficult problem assigned by a teacher in school. While everyone is still organizing their thoughts or struggling to understand the problem, the top student in the class has already finished first. As a result, most people tend to communicate with the top student for ideas or simply copy their work.

Recent controversies seem to confirm that many seemingly complex things are essentially the same. Previously, Musk's Grok AI was suspected of plagiarism or even shell Chat-GPT due to dataset contamination, and now ByteDance is suspected of violating service terms and has been banned by OpenAI.

ByteDance Caught in the Large Model Public Opinion Storm

Recently, foreign media The Verge reported that ByteDance used Microsoft's OpenAI API account to generate data to train its own artificial intelligence model, which actually violated the usage terms of both Microsoft and OpenAI. Shortly after this news was disclosed, The Verge further reported that OpenAI had suspended ByteDance's account.

So, what specific terms did ByteDance violate? In fact, there is a clear provision in OpenAI's terms of service, which is that the model capabilities provided by OpenAI are not allowed to be used to "develop any AI model that competes with its products and services."

According to The Verge, evidence comes from an internal document of ByteDance—chat records of the overseas version of Feishu, codenamed "Project Seed."

This document indicates that ByteDance relied on OpenAI's API for development, including training and evaluating models, in almost every development stage of the "Project Seed" large language model project, which started about a year ago. The project mainly focuses on two products: one is Douyin, which has already been launched in China, and the other is a chatbot platform for commercial users, which is currently under development.

Employees involved in the "Project Seed" were well aware of the consequences of excessive reliance on the OpenAI API, so they began discussing how to "desensitize" the data to cover up the evidence, to the extent that employees often reached the maximum access limit of the OpenAI API.

According to The Verge, ByteDance issued a directive to "stop using text generated by GPT at any stage of model development" a few months ago.

However, it was at this time that ByteDance released its own large language model, Douyin. Douyin AI introduced that it can provide functions such as chatbots, writing assistants, and English learning assistants, and can answer various questions and engage in conversations to help people obtain information. It supports web platforms, iOS, and Android platforms. Douyin can provide various types of assistance, including natural language processing, knowledge understanding, conversation, information retrieval, sentiment analysis, and machine learning.

However, ByteDance continued to use the API in violation of the terms of service of OpenAI and Microsoft, including evaluating the performance of the model behind Douyin. A person with first-hand knowledge of ByteDance's internal situation pointed out, "They say they want to ensure everything is legal, but in fact, they just don't want to be caught."

Three Parties Speak Out in Succession, ByteDance is the Only One in a Hurry

ByteDance

After The Verge published this report, ByteDance spokesperson Jodi Seth responded as follows: "Data generated by GPT was used to annotate models in the early development of 'Project Seed,' and it was removed from ByteDance's training data around the middle of this year. ByteDance has been authorized by Microsoft to use the GPT API. We use GPT to support our products in non-Chinese markets; in the Chinese market, we use our self-developed model to support Douyin."

Yesterday afternoon, a relevant person in charge of ByteDance responded again, stating, "When using OpenAI services, the company emphasizes compliance with its usage terms. We are also in contact with OpenAI to clarify any misunderstandings that may arise from external reports."

Introduction to ByteDance's use of OpenAI services:

At the beginning of this year, when the technical team was just starting to explore large models, some engineers applied GPT's API services to experimental project research of smaller models. The model was only for testing, had no plans to go live, and was never used externally. This practice was discontinued after the company introduced GPT API call specification checks in April.
As early as April of this year, the ByteDance large model team had made clear internal requirements not to add data generated by the GPT model to the training dataset of the ByteDance large model, and trained the engineering team to comply with the service terms when using GPT.
In September, the company conducted another round of checks internally and took measures to further ensure that the API calls to GPT complied with the specifications. For example, sampling the similarity of model training data to GPT in batches to avoid data labeling personnel using GPT without authorization.
In the next few days, we will conduct another comprehensive check to ensure strict compliance with the usage terms of the relevant services.

OpenAI

OpenAI spokesperson Niko Felix issued a statement confirming that ByteDance's account has been suspended. "All API customers must comply with our usage policy to ensure that our technology is used for good. Although ByteDance rarely uses our API, we have suspended their account during further investigation. If we find that their usage does not comply with company policy, we will require them to make necessary changes or terminate their account," Felix said.

Microsoft

In a statement, Microsoft spokesperson Frank Shaw said: "Azure OpenAI services and other Microsoft AI solutions are part of our limited access framework, which means that all customers must apply for and obtain Microsoft's approval to access. We have also established standards and provided resources to help our customers use these technologies responsibly and comply with our terms of service. We have also established processes to detect abusive behavior and stop their access when enterprises violate our code of conduct."

From the statements of the three parties in this incident, it can be seen that OpenAI is relatively conservative, only suspending ByteDance's account and stating that further measures will be decided after an investigation. Microsoft has a "not my business" attitude, as if to say, "I am just a middleman, we have our own regulations, and if there is a violation, we will ban it." ByteDance seems to be more anxious, after all, the "fire" is already burning. First, they clarified and explained, then immediately contacted OpenAI to quickly "put out the fire" on this incident.

ByteDance's AI Layout

Public information shows that as early as 2016, ByteDance established an AI lab focusing on research in natural language processing, machine learning, data mining, and other areas. ByteDance's products such as Douyin and Toutiao frequently incorporate AI-generated content (AIGC) features, continuously attracting traffic.

In 2023, ByteDance's actions in the field of AI have accelerated significantly. In June, ByteDance's Volcano Engine released the large model service platform "Volcano Ark," providing enterprises with comprehensive platform services such as model fine-tuning, evaluation, and inference.

In August, ByteDance's self-developed general large model "Yunque" appeared in the first batch of the list of large models under the "Interim Measures for the Administration of Generative Artificial Intelligence Services."

On August 17, ByteDance launched the AI chatbot "Douyin," developed based on the Yunque large model, targeting the consumer market.

Recently, while shrinking its gaming and XR business, ByteDance has established a new AI department called Flow. Related recruitment information shows that Flow is an AI innovation business team under ByteDance, which has already launched two products, "Douyin" and "Cici," in China and overseas, and is incubating several other AI-related innovative products.

At the same time, this year ByteDance ordered over $1 billion worth of GPUs from NVIDIA, and its order alone accounted for the total sales revenue of commercial GPUs in China last year. In addition, in terms of talent recruitment, among the top 10 companies with the most new AIGC positions, ByteDance ranks first, accounting for 3.24% of all new AIGC positions.

All these actions show ByteDance's high regard for AI and large models. Returning to the incident itself, would ByteDance, which values AI so highly, take such a big risk for the sake of "overtaking on a bend"?

New Voice of the Original Universe Has Something to Say

After the emergence of ChatGPT, ByteDance, like many other domestic giants, has been trying to keep up with the pace of AI. However, it seems that ByteDance is a bit behind, as the effectiveness of Douyin, which was launched after ChatGPT, did not reach a first-class level. If the AI trained using Chat-GPT only achieved this level of effectiveness, it seems unreasonable. However, if Douyin was not trained using Chat-GPT, then achieving this level of effectiveness would be expected.

When Musk's Grok AI was suspected of plagiarizing Chat-GPT, AI researcher Simon Willison, in an interview with Ars Technica, stated, "Many large models have been fine-tuned on datasets generated by the OpenAI API, or scraped from ChatGPT itself."

However, it is clear that these operations are within a reasonable range, and ByteDance may be the same. As for whether ByteDance is too "eager for quick success" and chooses to go beyond the reasonable range of use, as a large internet company, it should not engage in such "losing the forest for the trees" plagiarism behavior.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。