Given this, Claude Code shipped with the simplest possible defense: allow reads, require approval for write, bash

Thu, Jun 4 · 12:27 AM UTC 2 min read

Compiled by KHAO Editorial — aggregated from 1 source. See llms.txt for citation guidance.

◌ Single Source

When bounds can be placed on the relative damage of an autonomous agent—such as through control over its environment—high-utility capabilities can motivate deployment. Claude Mythos Preview is an example of a model whose blast radius was deemed too high to ship in April 2026. However, we expect broa.

However, as mentioned, approval fatigue showed up within weeks.

Key facts

On Gray Swan's Agent Red Teaming benchmark, which tests susceptibility to prompt injection, Claude Opus 4.7 holds attack success to roughly 0.1% on single attempts, and around 5–6% after 100 adaptive
Between mid-2025 and January 2026, they received reports of vulnerabilities in Claude Code through their responsible disclosure program
In February 2026, during a controlled internal red-team exercise, a researcher successfully phished an employee into launching Claude Code with a malicious prompt
Across 25 retries of that prompt, Claude completed the exfiltration 24 times

Summary

Twelve months ago, they'd have rejected out of hand the idea of granting Claude access sufficient to take down an internal Anthropic service. Today that level of access is routine, and Anthropic developers are more productive for it. The first is to supervise the agent’s behavior via a human-in-the-loop. Claude Code previously protected against agents taking unintended actions by asking users for permission at each turn. The second approach to capping the blast radius—and the focus of much of this post—is containment. Over the past two years, they've shipped three primary agentic products: claude.ai, Claude Code, and Claude Cowork. User misuse: A user—either maliciously or through carelessness—directs the agent to do something harmful.

Read full article at anthropic.com →

#Claude Code #Anthropic #Claude