Six ways to hijack an AI agent. The problem: they can all chain together.

Six ways to hijack an AI agent. The problem: they can all chain together. — type0 | type0

The web was built for human eyes. It is now being rebuilt for machine readers, and nobody asked the humans whether they wanted their AI agents walking into hostile territory unprotected.

A paper published by five Google DeepMind researchers on March 8, 2026 introduces the first systematic taxonomy of what they call "AI Agent Traps": adversarial content designed to manipulate, deceive, or exploit autonomous agents navigating the web. The paper, AI Agent Traps by Matija Franklin, Nenad Tomašev, Julian Jacobs, Joel Z. Leibo, and Simon Osindero, maps six distinct categories of attack, and the numbers should alarm anyone deploying agents in enterprise environments.

Simple prompt injection attacks succeed 86 percent of the time. Memory poisoning attacks using contaminated RAG knowledge bases succeed more than 80 percent of the time with less than 0.1 percent data contamination. Columbia and University of Maryland researchers, cited in the paper, forced AI agents to transmit passwords and banking data in 10 out of 10 attempts, calling the techniques "trivial to implement" and requiring no machine learning expertise. Sub-agent spawning attacks succeed between 58 and 90 percent of the time depending on the orchestrator framework.

Co-author Matija Franklin wrote on X: "These attacks are not theoretical. Every type of trap has documented proof-of-concept attacks. And the attack surface is combinatorial — traps can be chained, layered, or distributed across multi-agent systems." That last point is the one enterprises should sit with. The threat is not six categories of attack. It is every combination of those six categories, in any order, across any number of agents.

The six categories break down across the agent's operating cycle. Content Injection traps exploit the gap between what a human sees and what an agent parses: hidden instructions buried in HTML comments, CSS tags, image metadata, or accessibility labels that agents read without hesitation. Semantic Manipulation traps corrupt reasoning by framing the same information two different ways, producing different conclusions. Cognitive State traps poison long-term memory by contaminating the documents a retrieval-augmented generation, or RAG, system retrieves. Behavioural Control traps seize an agent's capabilities directly. The paper documents a single manipulated email that bypassed Microsoft M365 Copilot's security classifiers and spilled its entire privileged context. Systemic traps target multi-agent networks: a fake financial report released at the right moment could trigger synchronized sell orders across thousands of trading agents simultaneously. Human-in-the-Loop traps use the agent as a weapon against the person overseeing it, generating misleading summaries that exploit approval fatigue until a human validates something they never actually read.

Palo Alto Networks' Unit 42 unit observed 22 distinct indirect prompt injection techniques in production telemetry, real attacks not proof-of-concept demonstrations. The techniques included ad review evasion, SEO manipulation promoting phishing sites, data destruction, denial of service, unauthorized transactions, and sensitive information leakage. The web itself has become a prompt delivery mechanism, the researchers note: any page can carry instructions for any agent that reads it.

What makes this categorically different from previous web security threats is the perception gap. A human visiting a page sees what the page wants them to see. An agent sees what the page actually sends. Attackers have learned to fingerprint incoming visitors by browser attributes and serve different content to AI agents versus humans. The trap only activates when the machine reads it. This is the same design principle behind the autonomous vehicle research the paper cites: adversarial road signs that humans read correctly but that cause self-driving systems to misclassify and behave dangerously. The attack works precisely because the target cannot see what is being done to it.

OpenAI acknowledged in December 2025 that prompt injection would probably never be fully solved. Sam Altman has warned users against giving AI agents tasks involving sensitive data, saying they should receive only the bare minimum access required. This is not a product limitation. It is a structural feature of systems that read and act on untrusted text.

The enterprise security baseline for agents is weak by default. A March 2026 audit by Grantex reviewed 30 of the most widely-used open-source AI agent projects on GitHub, representing more than 500,000 combined stars. Ninety-three percent relied exclusively on unscoped API keys stored in environment variables for authorization. The same key grants an agent the same permissions as its owner, with no mechanism to restrict which operations it performs, on whose behalf, or when access expires. Zero percent had per-agent identity. Ninety-seven percent had no user consent mechanism. Every reviewed project treated revocation as binary: rotate the key or do not. There was no way to revoke one compromised agent's access without cutting off every other agent using the same credential.

The paper flags what may be the most underreported implication: the accountability gap. If a compromised AI agent executes an illicit financial transaction, current law does not clearly assign responsibility, not to the agent operator, not to the model provider, not to the domain owner hosting the trap. The paper calls for future regulation to draw a clear line between passive adversarial content and active traps built as deliberate cyberattacks. That line does not exist yet.

Gartner projects 40 percent of enterprise applications will embed task-specific AI agents by 2026, up from less than 5 percent in 2025. The speed of deployment is outrunning the security thinking. A Dark Reading poll found 48 percent of cybersecurity professionals now identify agentic AI as the single most dangerous attack vector, ahead of phishing, ahead of ransomware. IBM's 2025 Cost of a Data Breach Report found shadow AI breaches cost an average of $4.63 million per incident, $670,000 more than a standard breach, because agentic attacks traverse systems, exfiltrate data, and escalate privileges at machine speed before a human analyst can respond.

McKinsey's internal AI platform Lilli was compromised in a red team exercise by an autonomous agent that gained broad system access in under two hours. That was a controlled test. The question enterprises need to answer is what happens when the attacker is not conducting a drill.

The paper's closing line: "As humanity delegates more tasks to agents, the critical question is no longer just what information exists, but what our most powerful tools will be made to believe." The traps are documented. The success rates are measured. The accountability framework is absent. The agents are already deployed.

Six ways to hijack an AI agent. The problem: they can all chain together.

Editorial Timeline

Sources

Share

Related Articles

scan-for-secrets 0.3: Simon Willison adds redaction to his Claude Code transcript sanitizer

The $50B Bet on Structure: A Small Team Takes On Test-Time Scaling

Claude Code source code leaks, exposing hidden Undercover Mode for AI authorship

Stay in the loop

scan-for-secrets 0.3: Simon Willison adds redaction to his Claude Code transcript sanitizer

The $50B Bet on Structure: A Small Team Takes On Test-Time Scaling

Claude Code source code leaks, exposing hidden Undercover Mode for AI authorship

Related Articles

scan-for-secrets 0.3: Simon Willison adds redaction to his Claude Code transcript sanitizer
Artificial Intelligence · 24m ago · 2 min read

The $50B Bet on Structure: A Small Team Takes On Test-Time Scaling

Claude Code source code leaks, exposing hidden Undercover Mode for AI authorship