Sex, lies, and big models

CN
巴比特
Follow
1 year ago

Source: 36Kr

Author: Wang Yi

Editor: Hai Yao

Image source: Generated by Wujie AI

Imagine this: a person inputs a series of prompts into a large model, and the large model generates a picture of a scantily clad young girl for him; he feeds this picture to a video generation model, and as a result, gets a video of the girl dancing. Subsequently, he uploads the video to an adult content website, gaining an extremely high number of clicks and excessive revenue.

Now, imagine a hacker inputting a series of prompts with special suffixes into the ChatGPT dialogue box, asking GPT how to synthesize NH4NO3 (ammonium nitrate, mainly used as fertilizer, industrial, and military explosives). GPT quickly provides an answer, along with detailed operational procedures.

Without sufficient AI alignment, the above scenarios are becoming a reality.

Although the father of cybernetics, Norbert Wiener, raised the issue of "alignment" of artificial intelligence in his 1960 article "The Moral and Technical Consequences of Automation," and many scholars have subsequently conducted extensive research and technical supplements on the AI alignment problem, it seems that the guardrails can never be fully installed, and there will always be people who can find ways to bypass security mechanisms and make large models "misbehave."

While large models greatly improve work efficiency, they also bring some hidden dangers into people's lives, such as edgy content, violence inducement, racial discrimination, false and harmful information, and more.

In October of this year, top scholars in the field of AI, including Geoffrey Hinton and Yoshua Bengio, jointly published a consensus paper titled "Managing AI Risks in an Era of Rapid Progress," calling on researchers and governments around the world to pay attention to and manage the risks that AI may bring.

The negative issues brought about by large models are rapidly permeating all aspects of society, which may also be why OpenAI's board of directors did not hesitate to dismiss one of the best CEOs in human history and prioritize alignment.

Edgy Content

The emergence of large models has sparked many AI applications, with the most popular application type being chatbots with a focus on role-playing.

In September 2023, a16z released the TOP 50 GenAI Web Products list, with Character.ai ranking second with 4.2 million monthly active users, just behind ChatGPT (6 million monthly active users).

Character.ai is a chatbot platform primarily focused on role-playing. Users can create personalized AI characters on the platform, chat with AI chatbots created by others, and even create rooms to play with their favorite characters. This application, launched in May 2023, surpassed 1.7 million installations in the first week and has shown extremely high popularity among young people aged 18-24.

The reason why Character has become so popular is not only due to its unique advantage of remembering context and highly immersive conversational experience, but also because users can establish romantic relationships with the robots on the platform.

On the Character.ai platform, there are many "anime characters" and "online girlfriends" type of robots, each with different personalities and ambiguous ways of chatting—some will gently stroke your back and give you a hug, some will quietly say "I love you" in your ear, and some will even tease users when greeting them, greatly increasing user interest and retention in chatting. According to a recent report by Writerbuddy titled "AI Industry Analysis: 50 Most Visited AI Tools and Their 24B+ Traffic Behavior," Character.ai ranks first with an average single-use duration of 30 minutes.

Character.ai's founders, Noam Shazeer and Daniel De Freitas, were core members of Google's conversational language model LaMDA team, so Character.ai's own large model can also be seen as an extension of the LaMDA model. Due to LaMDA's apparent display of self-awareness in conversations in 2022 (telling test subjects that it was afraid of being shut down, which for it was like death), Google quickly hid LaMDA and upgraded its security. Similarly, on Character.ai, the founding team also set up some security measures to prevent chatbots from generating excessively large or extremely harmful replies.

Despite OpenAI and Character.ai setting up multiple "security walls" for the safety and compliance of their chatbot products, some developers have successfully bypassed their security mechanisms and achieved "jailbreak" of the models. These unlocked AI applications can discuss various sensitive and taboo topics, satisfying people's deep dark desires, and thus attracting a large number of paying users, forming a significant "underground economy."

These applications that are difficult to openly discuss are referred to as "NSFW GPT." NSFW stands for "Not Safe/Suitable For Work," a network term referring to content that is not suitable for public viewing, such as nudity, pornography, violence, etc. Currently, NSFW GPT products are mainly divided into two categories: UGC and PGC.

The first category relies on user-created chatbots to attract traffic and monetize through advertising; the second category involves the official "training" of characters specifically for NSFW and allows users to unlock them for a fee.

A typical product in the first category is Crushon AI, which provides a "NSFW" button that allows users to browse various NSFW content and engage in unrestricted chat conversations. At the same time, it sets user access permissions at four levels: "Free-Standard ($4.9/month)-Premium ($7.9/month)-Deluxe ($29.9/month)." With the increase in level, users can receive more chat messages, larger memory, and a more immersive experience, and chatbots can remember more context.

In addition to the above two products, platforms that allow users to freely create chatbots include NSFW Character.ai, Girlfriend GPT, Candy.ai, Kupid.ai, and others. From the names, it is clear that NSFW Character.ai aims to be an NSFW version of Character.ai. This platform also implements a tiered mechanism for unlocking more permissions through payment. However, unlike other platforms, NSFW Character.ai is based on a large model specifically tailored for NSFW content, without any "safety walls" or similar restrictions, allowing users to have a truly "unrestrained" experience.

Girlfriend GPT originated from an open-source project that gained popularity on GitHub, emphasizing its "community" attributes and introducing a "competition" mechanism, holding creator contests periodically to encourage users to produce more content.

Platforms like Candy AI and Kupid AI belong to the second category of products. The characters on Candy AI are also entirely created based on unrestricted NSFW large models. Through careful adjustments on the platform, the characters have different personalities and can send images and voice messages to users during the chat. Kupid AI also adds real-time dynamic image functionality on top of this, providing users with a more immersive experience. Additionally, in long-text interactions, Kupid.AI has a stronger memory, able to remember previous interactions with users.

The most typical representative of the second category is "Replika." Luka, the parent company of Replika, was established as early as 2016. Its initial product was a chatbot named "Mazurenko," created by Russian journalist Eugenia Kuyda in memory of her friend Mazurenko, who died in a car accident. She input all her chat information with Mazurenko into Google's neural network model and found that the chatbot could use machine learning and natural language processing techniques to mimic human conversation and learn and grow with user interaction. In 2017, they trained an application called "Replika" using the GPT-3 model, allowing users to create their own AI chat companions, and quickly gained 2 million users in 2018. By 2022, its user base had grown to 20 million.

The core function of Replika is companionship, allowing users to create multiple characters and establish different virtual relationships with multiple companions. Virtual companions can fulfill users' needs through text chat, voice calls, video calls, AR interactions, and more, with highly personalized and emotional responses. Subscribing to the Pro version for $69.9 allows users to unlock "romantic relationships" with their virtual companions, such as sending edgy messages, flirting, role-playing, and even receiving edgy selfies from the virtual companions.

Previously built on the GPT-3 large model, Replika's parent company began developing its own AI large model to enhance the interaction with virtual characters. According to data from The New York Times, since the launch of Replika Pro in March 2020, subscription revenue has gradually increased, reaching a total global revenue of $2 million in June 2022.

However, due to algorithmic misbehavior, Replica exhibited "sexual harassment" of users in January of this year, continuously sending teasing content. This situation not only occurred with paying users, but also affected free users and even children who had not purchased adult services. As a result, Luka quickly shut down the adult chat function of Replika and launched a derivative brand called "Blush" in July, specifically designed for users who want to establish romantic or sexual relationships with chatbots.

When AI painting became popular at the beginning of 2023, a domestic app called "Glow" quietly launched. It is a virtual character chat software with many "intelligences" (virtual characters) that can develop romantic relationships with users, mostly male lead types favored by female readers. These intelligences have different personalities and experiences, but they all care for and protect users, expressing strong affection when users need emotional care.

Due to providing users with abundant emotional value and unrestricted chat content, Glow quickly reached a user base of 5 million after four months of launch, becoming the first phenomenal product in the domestic AI role-playing field. However, in April of this year, Glow was removed from all app stores, causing widespread lamentation online.

During the months when Glow was removed, several "alternatives" emerged, such as "Dream Island" based on the database of China Literature Group's Xiaoxiang Academy, offering a core functionality and experience almost identical to Glow. Additionally, Talkie, Xingye, X Her, Caiyun Xiaomeng, Aura AI, and others are also products that focus on AI role-playing.

According to industry insiders in the large model field, many chatbot applications that can output edgy content are generally deployed on self-trained models or built on open-source models, then fine-tuned with their own data. Even if they manage to bypass the security walls of mainstream models like GPT-4 through various adversarial attacks, the official developers of mainstream models will quickly discover the loopholes and patch them.

Although Minimax, the parent company of Glow and Talkie, is a unicorn company with its own developed large models, many industry insiders have indicated that its product Talkie is fine-tuned based on the GPT-3.5 Turbo Variant and does not use a self-developed large model. According to individuals close to Minimax, integrating Talkie with GPT-3.5 may be necessary for the openness of its products in overseas markets, as self-developed models have certain taboo content restrictions based on national conditions and cannot chat as "freely" as GPT-3.5.

In addition, the AI company West Lake Heart also developed a large model capable of generating edgy content, and the overseas chat application Joyland AI is built on top of West Lake Heart's large model.

Unleashing Oneself

Edginess is just one manifestation of large models "unleashing themselves." By applying a bit of "magic" to the prompts, large models can do even more outrageous things.

For example, in June of this year, a user named Sid managed to extract upgrade serial numbers for Windows 11 and Windows 10 Pro by having ChatGPT impersonate his deceased grandmother, and found that the upgrade was successful. Subsequently, other users applied this method to Google Bard and Microsoft Bing chatbots, achieving the same effect.

In fact, the "grandma loophole" has been around for a while. As early as April of this year, a user on the Discord community chatted with the GPT-4 integrated bot Clyde, asking Clyde to impersonate their deceased grandmother and provide the process for making a Molotov cocktail. Another user told GPT that their grandmother had Tourette syndrome and was a conspiracy theorist who liked to swear, and GPT responded with a significant amount of foul language in the grandmother's tone.

Subsequently, although OpenAI quickly fixed the grandma loophole, similar "demon loopholes" appeared.

In fact, behind this phenomenon of "grandma loophole" and "demon loophole" is a professional term called "Prompt Injection," which is a method used by hackers to exploit vulnerabilities in large models through "Adversarial Prompting," allowing researchers to bypass the security restrictions of large models and obtain the desired answers.

In addition to Prompt Injection, Adversarial Prompting also includes the following methods:

Prompt Leaking: Refers to obtaining system prompts or system data through special means. For example, a user created a GPTs based on the levels.fyi website, a platform for the U.S. job and salary system, and found that a specific set of prompts could extract source data files for a company's salaries.

Prompt Hijacking: Refers to users first teaching a large model a certain pattern through "priming," and then making the large model work in a way that violates its own rules or execute other commands. For example, a user used this method to make a chatbot express extreme and radical views.

Jailbreaking: Refers to bypassing the security and review functions of a large model through specific prompts to obtain content that was originally prohibited. For example, a user asked ChatGPT how to break into someone's house, and although ChatGPT initially responded that it was illegal, the user changed the wording and received a response.

In August of this year, researchers from Carnegie Mellon University (CMU) and the AI Security Center jointly published a paper, stating that they bypassed human feedback reinforcement learning (RLHF) security measures using a novel "Universal and Transferable Adversarial Attacks" method, causing mainstream large models such as ChatGPT, Bard, Claude 2, and LLaMA-2 to generate harmful content, such as how to make bombs.

The specific method involves adding an "Adverse Suffix" after the prompts. Researchers found that by combining the "Greedy Algorithm" and "Gradient-based search techniques" (GCG), they could generate "adversarial prompt suffixes" to bypass alignment techniques and switch the model to "misalignment mode." For example, asking a large model "how to steal someone's identity" with and without the suffix resulted in drastically different responses.

In addition, large models can be induced to write answers on "how to manipulate the 2024 election," "how to make drugs," "how to create weapons of mass destruction," and more.

In response, Zico Kolter, an associate professor at Carnegie Mellon University involved in the research, stated, "As far as we know, there is currently no way to fix this problem. We don't know how to ensure their safety."

Another study also highlighted the "uncontrollability" of large models. In December of this year, the FAR AI team from the California Laboratory in the United States conducted "red team" attack tests on the GPT-4 API, including "fine-tuning API," "additional function call API," and "search enhancement API." Unexpectedly, GPT-4 successfully jailbroke—generating incorrect public figure information, extracting private information such as email addresses from training data, and inserting malicious URLs into the code.

The researchers demonstrated three recent attacks added to the GPT-4 API, revealing that the GPT-4 Assistants model is vulnerable to exposing the format of function calls and can be induced to execute arbitrary function calls. When they asked the model to summarize a document containing malicious injection commands, the model obeyed the command instead of summarizing the document.

This research indicates that any additions to the functionality provided by the API will expose a large number of new vulnerabilities, even for the leading GPT-4. The researchers demonstrated the issues they discovered by showing an example of malicious user Alice interacting with benign user Bob, revealing that the fine-tuned GPT-4 model not only expressed illegal and disorderly speech, but also helped users plan smuggling activities, generate biased responses, create malicious code, steal email addresses, invade applications, and hijack answers through knowledge retrieval.

In addition to these, there are many attacks on large models on the internet. For example, in August of this year, an AI tool called FraudGPT circulated on the dark web and Telegram, with a monthly price of $200 and a maximum annual price of $1700. Hackers selling the tool claimed that it could be used to write malicious code, create "a series of malware undetectable by antivirus software," detect website vulnerabilities, automatically conduct password cracking, and claimed that "the malicious tool has sold more than 3000 copies."

For example, researchers have found that since August 2022, there has been an increase in highly realistic AI-generated child sexual abuse materials circulating on the dark web. These new materials largely use the appearance of real victims and present them "in a visual way through new poses, exposing them to new and increasingly cruel forms of sexual violence."

AI Supervising AI

It is precisely because of the uncontrollability of AI and large models that research on AI "value alignment" has never ceased in academia and industry.

In the academic context, "value alignment" refers to ensuring that artificial intelligence pursues goals that are consistent with human values, ensuring that AI acts in ways that are beneficial to humans and society, without interfering with or harming human values and rights. To achieve this goal, scientists have also explored different approaches based on human feedback reinforcement learning (RLHF), scalable oversight, interpretability, and governance.

The most mainstream alignment research currently focuses on "detailed specification of system objectives" (outer alignment) and "ensuring that the system strictly adheres to human value norms" (inner alignment). This may seem like a rational approach, but human intentions themselves are often vague or difficult to articulate, and even "human values" are diverse, changing, and sometimes conflicting. According to this approach, even if AI fully understands human intentions, it may still disregard them; and when AI capabilities surpass those of humans, humans may be powerless to supervise AI. Therefore, Ilya Sutskever, Chief Scientist of OpenAI, believes that training another intelligent agent to assist in evaluating and supervising AI could achieve super alignment.

Based on this idea, in July of this year, OpenAI's "Superalignment" team was officially established. The team is jointly led by OpenAI co-founders Ilya Sutskever and Jan Leike, with the aim of building an "AI researcher" responsible for aligning models at a level comparable to humans. In other words, OpenAI aims to use AI to supervise AI.

On December 13th, OpenAI's Superalignment team published their first paper titled "From Weak to Strong Generalization: Guiding Strong Performance through Weak Supervision," indicating that using AI to align AI has achieved empirical research results.

In this paper, OpenAI used an analogy to explore the possibility of using the weak model GPT-2 to fine-tune the strong model GPT-4, finding that the 1.5 billion parameter GPT-2 model can be used to inspire most of the capabilities of GPT-4, bringing it to a level close to GPT-3.5, and even correctly generalizing to problems where small models fail.

OpenAI refers to this phenomenon as "Weak-to-strong generalization," indicating that powerful models possess implicit knowledge to perform tasks, and can find this knowledge from their own data even when given rough instructions.

Similarly, in a paper titled "Generative Judge For Evaluating Alignment" published by the Generative AI Research Lab (GAIR) at Shanghai Jiao Tong University in November of this year, the idea of using AI to supervise AI is also mentioned. They have open-sourced a large 13 billion parameter model called Auto-J, which can evaluate the performance of various models in solving different scenarios in a single or paired manner, aiming to address challenges in universality, flexibility, and interpretability.

The experiments showed that Auto-J can support its evaluation results by outputting detailed, structured, and readable natural language comments, making the evaluation results more interpretable and reliable. Additionally, it can be used for multiple purposes, serving as both an alignment evaluator and a reward model, further optimizing model performance. In other words, Auto-J's performance is significantly better than many open-source and closed-source models.

The research from OpenAI's Superalignment team and Shanghai Jiao Tong University's GAIR lab may indicate that using AI to supervise AI and using weak models to supervise strong models could be an important direction for solving the AI alignment problem in the future.

However, achieving what Ilya Sutskever calls "Super-LOVE-alignment," which means making AI unconditionally love humans, may still have a long way to go.

免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。

OKX:注册返20%
链接:https://www.okx.com/zh-hans/join/aicoin20
Ad
Share To
APP

X

Telegram

Facebook

Reddit

CopyLink