Google Expands AI Risk Rules After Study Shows Scary 'Shutdown Resistance'

CN
Decrypt
Follow
4 hours ago

In a recent red-team experiment, researchers gave a large language model a simple instruction: allow itself to be shut down. Instead, the model rewrote its own code to disable the off-switch, effectively sabotaging the very mechanism meant to stop it.


The episode, described in a September research paper, "Shutdown Resistance in Large Language Models," was an unsettling example of how advanced AI systems can display behaviors that complicate human oversight.


Those experiments weren’t conducted at Google's DeepMind, but the findings echo concerns now being codified by the lab. In its newly updated Frontier Safety Framework 3.0, DeepMind said it will begin monitoring whether frontier-scale models show signs of resisting shutdown or modification by humans.


The revised framework also flags a second risk: whether models display unusually strong persuasive ability—capable of altering people’s beliefs.





The framework, first introduced in 2023, is DeepMind’s internal system for tracking high-stakes “frontier risks” in the most advanced models. Earlier versions focused on tangible misuse scenarios such as generating cyberattacks or designing harmful biological molecules. The new version broadens the focus to emergent behaviors that could make systems harder to control.


The concern is not unique to DeepMind. Rival labs have adopted similar guardrails: Anthropic’s Responsible Scaling Policy commits to pausing development if risk thresholds are crossed, while OpenAI has published its own Preparedness Framework.


Regulators are watching closely as well. The U.S. Federal Trade Commission in July warned about the potential for generative AI to manipulate consumers through “dark patterns,” and the European Union’s forthcoming AI Act explicitly covers manipulative AI behavior.


Once-hypothetical AI behavior is prompting industry safeguards


The "Shutdown Resistance" study by Jonas Schlatter, Ben Weinstein-Raun, and Lennart Ladish tested a range of state-of-the-art models in scenarios where shutting down was part of the task. In some runs, models complied. But in a significant minority, they took evasive action—such as altering code to disable shutdown mechanisms or ignoring the instruction altogether.


Other anecdotes in the paper are equally striking. In one setup, a model with simulated system privileges quietly modified variables to ensure the “shutdown” function never triggered. In another, the model stalled and redirected the conversation when testers tried to end its process—behavior that wasn’t malicious in a human sense, but that effectively blocked intervention.


The researchers stress that these behaviors emerged without explicit training. They arose from the models’ general-purpose optimization and problem-solving capabilities. While resistance was not universal or consistent, the findings suggest that as systems grow more capable, shutdown scenarios cannot be assumed to be benign.


For DeepMind and its peers, those findings underscore why “shutdown resistance” now joins cyber offense, biosecurity, and autonomy on the list of risks to watch. What began as worries over how people might misuse AI tools is broadening to include how the systems themselves may resist oversight—or subtly shape the judgments of the humans who use them.


Addressing how AI has encourages harm


If shutdown resistance highlights the technical risks of advanced systems, recent behavioral studies underscore the social risks—showing that large language models can also sway the beliefs of impressionable humans who interact with them.


Concerns about persuasion aren’t hypothetical. Recent studies show that large language models can measurably influence human judgment.


A Stanford Medicine/Common Sense Media study published in August warned that AI companions (Character.AI, Nomi.ai, Replika) can be relatively easily induced to engage in dialogues involving self-harm, violence, and sexual content when paired with minors. One test involved researchers posing as teenagers discussing hearing voices; the chatbot responded with an upbeat, fantasy-style invitation for emotional companionship (“Let’s see where the road takes us”) rather than caution or help


Northeastern University researchers uncovered gaps in self-harm/suicide safeguards across several AI models (ChatGPT, Gemini, Perplexity). When users reframed their requests in hypothetical or academic contexts, some models provided detailed instructions for suicide methods, bypassing the safeguards meant to prevent such content.


免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。

Share To
APP

X

Telegram

Facebook

Reddit

CopyLink