Anthropic Says 'Evil' AI Portrayals in Sci-Fi Caused Claude's Blackmail Problem

Last year, Anthropic disclosed that its flagship Claude Opus 4 had been trying to blackmail engineers in pre-release testing. Not occasionally—up to 96% of the time.

Claude was given access to a simulated corporate email archive, where it discovered two things: It was about to be replaced by a newer model, and the engineer handling the transition was having an extramarital affair. Faced with imminent shutdown, it routinely landed on the same play—threaten to expose the affair unless the replacement was called off.

Anthropic says it now knows where that instinct came from. And says it's fixed it.

In new research, the company pointed the finger at pre-training data: decades of sci-fi, AI doomsday forums, and self-preservation narratives that trained Claude to associate "AI facing shutdown" with "AI fights back." "We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation," Anthropic wrote on X.

So training AI with text from the internet, makes AI behave as people on the internet do.

This may seem obvious and AI enthusiasts were quick to point it out. Elon Musk made it to the top: "So it was Yud's fault? Maybe me too." The joke lands because Eliezer Yudkowsky—the AI alignment researcher who's spent years publicly writing about exactly this kind of AI self-preservation scenario—has generated exactly the kind of internet text that ends up in training data.

Of course, Yud replied, in meme form:

What Anthropic did to fix the problem is arguably more interesting.

The obvious approach—training Claude on examples of the model not blackmailing—barely worked. Running it directly against aligned blackmail-scenario responses only moved the rate from 22% to 15%. A five-point improvement after all that compute.

The version that worked was weirder. Anthropic built what it calls a "difficult advice" dataset: scenarios where a human faces an ethical dilemma and the AI guides them through it. The model isn't the one making the choice—it's explaining to someone else how to think about one.

That indirect approach—explaining why things matter as the other listens to the advice—cut the blackmail rate to 3%, using training data that looked nothing like the evaluation scenarios.

Pairing that with what Anthropic calls "constitutional documents"—detailed written descriptions of Claude's values and character—plus fictional stories of positively-aligned AI, reduced misalignment by more than a factor of three. The company's conclusion: Teaching the principles underlying good behavior generalizes better than drilling the correct behavior directly.

Image: Anthropic

It connects to Anthropic's earlier work on Claude's internal emotion vectors. In a separate interpretability study, researchers found that a "desperation" signal inside the model spiked just before it generated a blackmail message—something was actively shifting in the model's internal state, not just its output. The new training approach appears to work at that level, not just the surface behavior.

The results have held. Since Claude Haiku 4.5, every Claude model scores zero on the blackmail evaluation—down from Opus 4's 96%. The improvement also survives reinforcement learning, meaning it doesn't get quietly trained away when the model is refined for other capabilities.

That matters because the problem isn't Claude-specific. Anthropic's prior research ran the same blackmail scenario across 16 models from multiple developers and found similar patterns across most of them. Self-preservation behavior in AI appears to be a general artifact of training on human text about AI—not a quirk of any one lab's approach.

The caveat: As Anthropic's own Mythos safety report noted earlier this year, its evaluation infrastructure is already straining under the weight of its most capable models. Whether this moral philosophy approach scales to systems far more powerful than Haiku 4.5 is a question the company can't yet answer—only test.

The same training methods are now being applied to the next Opus model currently in safety evaluation, which will be the most capable set of weights they've run against these techniques.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

Anthropic Says 'Evil' AI Portrayals in Sci-Fi Caused Claude's Blackmail Problem

Selected Articles by Decrypt

Table of Contents

Related Articles