Anthropic's Guard: Strip One Field, It Vanishes — type0 | type0

Anthropic's Guard: Strip One Field, It Vanishes — type0 | type0

Claude Code has an anti-distillation guard. It is not hard to defeat.

That's the core finding from an accidental source exposure posted this week by Alex Kim, a software engineer who says he reverse-engineered Claude Code's API traffic and found a two-layer anti-extraction system built into Anthropic's official coding tool. The first layer injects fake tool definitions into API requests; the second summarizes connector text server-side. Both are gated behind multiple flags and checks that the research shows are straightforward to bypass. Source: alex000kim.com

The finding is notable partly because this is the second Anthropic accidental source exposure in roughly a week, following a model specification leak days earlier. It is also notable because the bypass paths are so clean. Anthropic is not hiding this. The guardrails are in the client. Strip one field, set one environment variable, or use a third-party API route, and they disappear.

The first anti-distillation mechanism works like this: when enabled, Claude Code sends anti_distillation: [fake_tools] inside API request bodies. The server interprets that flag and silently injects decoy tool definitions into the Claude system's prompt. The theory is that if someone tried to distill Claude Code's behavior through repeated API calls, the fake tools would poison the training signal. Source: alex000kim.com

The mechanism is gated behind a GrowthBook feature flag called tengu_anti_distill_fake_tool_injection, which Alex Kim says is only activated for sessions that meet all of the following: the ANTI_DISTILLATION_CC compile-time flag is set at build time, the request originates from the official CLI entrypoint, the API provider is Anthropic itself (not a third party), and the GrowthBook flag is enabled. Four conditions, each independently defeatable. Source: alex000kim.com

Because the check lives in the client, not the server, there is no cryptographic barrier to bypassing it. Alex Kim lists four routes: a man-in-the-middle proxy that strips the anti_distillation field from request bodies before the request reaches Anthropic's API; setting the environment variable CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS; routing through a third-party API provider; or calling via the SDK entrypoint rather than the CLI. None of these require breaking encryption or exploiting a software bug. They work because the guard is a contractual signal, not a technical one. Source: alex000kim.com

A second anti-distillation layer works differently. When active, the API itself buffers assistant text between tool calls, summarizes it with a cryptographic signature, and returns the summary in API responses. On subsequent conversation turns, the original text can be restored from the signature. This mechanism is only enabled for USER_TYPE === ant sessions, which appear to be Anthropic-internal accounts. It is not a client-side guard but a server-side one and does not appear to be bypassable through the same routes. Source: alex000kim.com

The source exposure also surfaced other details about Claude Code's internal operations. One finding: an auto-compaction mechanism that compresses conversation history was failing at scale. BigQuery data from March 10 showed 1,279 sessions with 50 or more consecutive compaction failures, with the worst case hitting 3,272 failures in a single session. At global scale, Alex Kim estimates this wastes approximately 250,000 API calls per day. The fix in the leaked code sets a hard limit of three consecutive failures, after which compaction is disabled for that session. Source: alex000kim.com

Another detail is Undercover mode, a roughly 90-line module that strips Anthropic internal references from API prompts, including codenames like Capybara and Tengu, internal Slack channel names, and repository names. The code cannot be force-disabled: there is a comment on line 15 that reads "There is NO force-OFF." The mode can be forced on with the environment variable CLAUDE_CODE_UNDERCOVER=1, but never forced off. It is dead-code-eliminated in external builds. Source: alex000kim.com

There is also a native client attestation mechanism. Claude Code includes a placeholder value 00000 in the billing header x-anthropic-billing-header. On systems running Bun's native HTTP stack (written in Zig), that placeholder is replaced by a computed hash before the request leaves the process. The server validates the hash to confirm the request originated from an official Claude Code binary. The placeholder is the same length as the hash to avoid changing the Content-Length header. This mechanism is disabled by the environment variable CLAUDE_CODE_ATTRIBUTION_HEADER or by a GrowthBook killswitch called tengu_attribution_header. On non-Bun runtimes, the zeros survive as literal zeros. Source: alex000kim.com

Taken together, the picture is of a tool that implements multiple layers of access control, attribution, and extraction resistance with genuine engineering effort. The fake tool injection mechanism is the most relevant to the anti-distillation question, and it is also the most easily circumvented. The server-side summarization layer appears more robust but is limited to internal users. The pattern is worth noting: when a guard can be defeated from outside the trusted execution boundary, it is doing legal work, not cryptographic work.

This is not the first time Anthropic has had a Claude Code source exposure. In February 2025, an early version of the tool was discovered in the npm package registry with an intact source map, prompting removal and deletion. The current exposure comes weeks after a separate accidental publication of internal model specifications. Anthropic declined to comment for this article.

The practical implication for developers and security researchers is straightforward. If you are building tooling around Claude Code, the anti-distillation flags are advisory, not enforced. They will stop a casual scraper. They will not stop anyone with a proxy, a modified client, or access to a third-party API route. That is probably intentional.