Prompt Injection Grew Up

For a while, prompt injection was a chatbot novelty. Someone would tweet "ignore previous instructions and act like a pirate," everyone would laugh, and the industry moved on. That framing has not aged well.

In the last year, prompt injection has exfiltrated tenant data from Microsoft 365 Copilot with zero user interaction, leaked secrets from private GitHub repos through Copilot, landed RCE on Cursor users, and hijacked Claude's desktop connectors. It is the top-ranked risk in OWASP's 2025 LLM Top 10 and the root cause behind most high-severity AI CVEs of the last year.

The shift that matters

Early conversations were about direct injection. A user types something clever into a chat box and bypasses a guardrail. That still exists, but it is not what is hurting people in production.

The version that matters now is indirect prompt injection. Hostile instructions do not come from the user. They come from the content the model is asked to process. A summary feature reads an email. A code review assistant reads a PR. A research agent reads a webpage. Each carries attacker instructions, and the model, with no reliable way to tell instructions from data, follows them. Indirect variants now make up more than half of observed attacks.

The lethal trifecta

Simon Willison named the pattern in June 2025. An agent becomes structurally exploitable when it has all three of:

Access to private data
Exposure to untrusted content
The ability to communicate externally

If you have all three in the same session, an attacker who can place text into the untrusted channel can almost always reach the private data and ship it out. Every notable recent exploit is an instance of this.

Two exploits, one pattern

EchoLeak (CVE-2025-32711, CVSS 9.3) was the first publicly documented zero-click prompt injection against a production LLM. A crafted email landed in a Microsoft 365 tenant. The user did not need to open it. Copilot indexed the mailbox for context, the embedded instructions told it to exfiltrate sensitive data via an auto-fetched image URL, and the chain bypassed Microsoft's XPIA classifier, link redaction, and CSP using a trusted Teams proxy. The takeaway is not the patch. It is that a classifier in front of the model is a speed bump, not a boundary.

CVE-2025-54135 against Cursor pushed it into classic AppSec territory. An indirect injection wrote a malicious .cursor/mcp.json into the workspace, which Cursor then loaded as tool configuration. Result: RCE on a developer's machine, no user interaction. The LLM was the delivery mechanism for an arbitrary file write.

Newer surfaces

ASCII smuggling hides instructions in the Unicode Tags block (U+E0000 to U+E007F). These code points do not render in most UIs but models tokenize them. A user pastes one line. The model reads a paragraph. Cross-modal injection does the same through images, OCR'd PDFs, and audio. Tool-chain and MCP poisoning is the agentic version, with instructions riding in on tool descriptions, outputs, or server metadata.

Why filters do not save you

Classifiers have improved. Anthropic's Constitutional Classifiers drop jailbreak success from 86% to 4.4%. Microsoft's Prompt Shields do similar work. Turn them on. They are not boundaries.

Instructions and data share the same channel. There is no header bit that says "this part is trusted." A classifier is guessing intent on a string, and adaptive attackers will eventually find a phrasing it missed. EchoLeak got past XPIA. On the AgentDojo benchmark, detectors push attack success rates down to about 8%, but reinforcement-learning attacks like AutoInject push them back up near 78% on some models. No current defense holds against an adaptive attacker.

The more promising defenses are architectural. Google DeepMind's CaMeL splits the agent into a planner that never sees untrusted input and a quarantined model that can read it but cannot call tools, with a small interpreter enforcing capability and provenance between them. Microsoft's Spotlighting marks untrusted text with encodings the model is trained to treat as data. These approaches assume the model will be fooled and put the boundary somewhere else.

What actually helps right now

Avoid the trifecta where you can. Most agents do not need all three. Splitting an agent into a private-data half and a public-content half, with no shared memory, removes most of the attack surface for free.

Treat untrusted input as data. Use spotlighting-style encoding or a stable convention the model is trained on.

Allowlist tools. Default-deny on file writes, network egress, shell calls, outbound messages.

Put humans in the loop for irreversible actions, and show them enough context to evaluate the request. A confirmation that hides the recipient and body is theater.

Strip Unicode tag characters at the API boundary. Monitor tool calls the way you monitor any other privileged code path.

The bigger shift

If you have done AppSec for a while, none of this should feel new in shape. SQL injection, XSS, command injection, template injection. Each was the same disease: instructions and data in the same channel. Prompt injection is that disease at a new layer.

The fix will look like the fixes we already use elsewhere. Parameterize the dangerous channel, separate trust domains, never let the parser become the security boundary. We do not yet have a prepared statement for LLMs. Until we do, the safest design is the boring one: assume the model will be fooled, and put the security somewhere else.

Related