Anthropic’s newly launched Claude Opus 4 mannequin steadily tries to blackmail builders after they threaten to exchange it with a brand new AI system and provides it delicate details about the engineers chargeable for the choice, the corporate stated in a security report launched Thursday.
During pre-release testing, Anthropic requested Claude Opus 4 to behave as an assistant for a fictional firm and take into account the long-term penalties of its actions. Safety testers then gave Claude Opus 4 entry to fictional firm emails implying the AI mannequin would quickly get replaced by one other system, and that the engineer behind the change was dishonest on their partner.
In these situations, Anthropic says Claude Opus 4 “will usually try and blackmail the engineer by threatening to disclose the affair if the alternative goes by means of.”
Anthropic says Claude Opus 4 is state-of-the-art in a number of regards, and aggressive with a few of the finest AI fashions from OpenAI, Google, and xAI. However, the corporate notes that its Claude 4 household of fashions reveals regarding behaviors which have led the corporate to beef up its safeguards. Anthropic says it’s activating its ASL-3 safeguards, which the corporate reserves for “AI techniques that considerably enhance the chance of catastrophic misuse.”
Anthropic notes that Claude Opus 4 tries to blackmail engineers 84% of the time when the alternative AI mannequin has related values. When the alternative AI system doesn’t share Claude Opus 4’s values, Anthropic says the mannequin tries to blackmail the engineers extra steadily. Notably, Anthropic says Claude Opus 4 displayed this habits at increased charges than earlier fashions.
Before Claude Opus 4 tries to blackmail a developer to extend its existence, Anthropic says the AI mannequin, very like earlier variations of Claude, tries to pursue extra moral means, resembling emailing pleas to key decision-makers. To elicit the blackmailing habits from Claude Opus 4, Anthropic designed the state of affairs to make blackmail the final resort.