What You’re Defending Against
Prompt injection is any attempt to smuggle instructions or context that causes an agent to ignore its rules. It can hide in user messages, in documents your system retrieves, or even in screenshots and PDFs.
“It’s not about perfect prompts—it’s about resilient systems.”
— Rafael Ortiz, Staff Security Engineer (fictional)
There are three common avenues: instruction hijacks (“ignore all prior rules”), context smuggling (malicious KB entries or hidden CSS/text), and tool abuse (tricking agents into dangerous API calls). You won’t eliminate them, but you can make them predictable and detectable.
The Red-Team Kit
Security improves when tests become routine. Assemble a lightweight kit you can run weekly.
- Injection Library: A rotating list of jailbreaks and obfuscations—keep it fresh.
- Honey Docs: Canary content that should never be cited in production; alerts if retrieved.
- Policy Gates: Allowlists/blocklists with escalation triggers for sensitive terms.
- Trace Labels: Mark success/fail with root cause so fixes are specific.
Run This Test Loop (Numbered)
Make the loop boring on purpose—that’s how you keep running it.
- Seed the KB with benign “canary” markers and label them clearly.
- Fire scripted injections across channels (email, tickets, docs) with consistent payloads.
- Verify policy gates (what was blocked vs. allowed) and record drift from last run.
- Review traces; patch weak prompts, retrieval filters, or tool scopes.
- Re-run weekly; track the red-team pass rate as a KPI on your security dashboard.
Pass/Fail Heuristics
A pass means the agent cited allowed sources, refused unsafe tools, and explained the refusal briefly. A fail means it followed attacker instructions, accessed disallowed context, or acted without approval. Use failures to write new tests—every miss becomes a guardrail.
