AI Defense Techniques for LLMs: A Practitioner's Guide to Securing Large Language Models
A technical breakdown of proven AI defense techniques for LLMs — from input guardrails and prompt hardening to dual-model architectures and red teaming, mapped to OWASP and NIST frameworks.
Securing a large language model deployment is not the same as securing a traditional application. The attack surface is probabilistic, the input space is unbounded, and the model itself can be turned against the system it runs in. The ai defense techniques llm practitioners actually reach for in production are a layered set of controls — none sufficient alone, all necessary together.
This guide covers the techniques that hold up under real adversarial pressure, organized from the perimeter inward.
Understanding the Threat Landscape First
Before reaching for a tool, you need a map. The OWASP Top 10 for LLM Applications 2025 ↗ — assembled by 500+ contributors — is the closest thing the industry has to a canonical threat taxonomy. The 2025 list includes:
- LLM01: Prompt Injection — user or external inputs that override model instructions
- LLM02: Sensitive Information Disclosure — training data or system prompt leakage
- LLM03: Supply Chain — compromised base models, fine-tuning datasets, or plugins
- LLM04: Data and Model Poisoning — malicious pre-training or fine-tuning data
- LLM05: Improper Output Handling — insufficient sanitization of model responses before downstream use
- LLM06: Excessive Agency — models granted permissions beyond operational need
- LLM07: System Prompt Leakage — direct extraction of confidential system instructions
- LLM08: Vector and Embedding Weaknesses — poisoned RAG indexes or embedding attacks
- LLM09: Misinformation — confabulated outputs passed as facts
- LLM10: Unbounded Consumption — resource exhaustion via crafted requests
Prompt injection (LLM01) has held the top position for three straight years. It is also the vulnerability with the widest defensive literature, which is why most practical defense stacks center on it. But treat the whole list as your requirements document: each entry maps to a control family.
Layer 1: Input Guardrails
The first line of defense is inspection before the model sees any content. Every prompt — including retrieved RAG documents, tool outputs, and web content fetched by agents — should pass through a classifier trained to detect adversarial patterns.
The tldrsec/prompt-injection-defenses ↗ repository catalogs the main approaches:
Paraphrasing and retokenization. Pass the raw input through a lightweight model that paraphrases it. This disrupts token-level injection patterns while preserving semantic content. Retokenization breaks tokens into smaller units, defeating attacks that rely on specific subword sequences.
LLM-as-a-classifier (PromptGuard pattern). Route incoming input to a separate, dedicated filter model before the primary model. The filter model has a single job: return ALLOW or BLOCK. This is the approach behind open models like Llama Guard, ShieldGemma, and IBM Granite Guardian. Independent benchmarks show these filters cut injection success rates by 60–70% on common attack suites.
Structural delimiters and spotlighting. Wrap user input in explicit XML-style delimiters (<user_input>...</user_input>) and instruct the model that content inside those tags is data, never instructions. The spotlighting variant uses a different encoding (e.g., base64 or a Unicode transformation) for untrusted content, making provenance visible to the model at inference time.
Preflight testing. Before sending a concatenated prompt to the production model, run a deterministic test prompt against the input. If the output is non-deterministic in ways that suggest instruction injection, reject the request.
Layer 2: Prompt Architecture and Privilege Separation
System prompt design is the second control surface. Key techniques:
Instruction hierarchy. Fine-tuned models (GPT-4o, Claude 3.x) expose an explicit privilege stack: system prompt > operator instructions > user input. Rely on this hierarchy rather than trying to out-argue a malicious user in the prompt. Place the most critical behavioral constraints in the system turn only.
Post-prompting. Place user input before the final instruction rather than after it. This exploits positional recency bias — the model’s final instruction is the one at the end of the context, which is your controlled system text, not user-supplied content.
Least privilege for tool access. If a model can call external APIs, send emails, or write to a database, apply LLM06’s lesson: strip any capability the model does not need for the current task. An agent answering FAQs should not have write access to the CRM. Scoping tool permissions per conversation turn — not per deployment — is the gold standard. See guardml.io ↗ for open-source guardrail libraries that implement capability gating.
Layer 3: Output Filtering and Improper Output Handling
Everything the model returns is untrusted until proven otherwise. LLM05 (Improper Output Handling) covers cases where model output is passed directly into SQL queries, shell commands, browser rendering, or downstream LLM calls without sanitization.
Controls here mirror traditional output encoding:
- Strip or escape any output before inserting it into SQL, HTML, or shell contexts.
- Treat model-generated code as untrusted input to a static analyzer before execution.
- If the output is fed into another LLM (chained agents), apply the full input guardrail stack again — the second model has no way to distinguish between legitimate orchestrator instructions and injected content in the first model’s output.
The LLMOps practices at llmops.report ↗ document operational patterns for logging and inspecting all input/output pairs in production, which is the prerequisite for any retrospective detection of successful attacks.
Layer 4: Dual-Model and Ensemble Architectures
For high-stakes deployments, a single model handling both trusted and untrusted content is a structural weakness. The dual-LLM pattern addresses this:
- A privileged model handles trusted input (authenticated user sessions, verified tool output) and retains full tool access.
- A quarantined model handles untrusted external content (scraped web pages, user-uploaded files, third-party API responses) with no tool access. It can only return structured tokens to the privileged model.
Data passes between the two models through a narrow typed interface — not raw text. This eliminates the mechanism by which injected instructions cross the trust boundary.
For decisions with irreversible consequences (sending an email, executing a trade, deleting a record), ensemble voting adds a second check: route the proposed action to two or three independent model instances and only proceed if they agree. Attacks that work on one model variant often fail on others.
Layer 5: Red Teaming and Continuous Evaluation
Static controls drift. Models get updated, prompts get modified, new attack patterns emerge. NIST AI 600-1 ↗, the Generative AI Profile of the AI Risk Management Framework, frames continuous adversarial testing as a core governance requirement, not a one-time exercise.
Practical red teaming for LLM deployments includes:
- Automated adversarial probing. Tools like Garak, Promptfoo, and DeepTeam automate attack generation against a target model, running hundreds of jailbreak attempts, injection payloads, and data-extraction probes.
- Jailbreak regression suites. Maintain a database of known-bad prompts — including those logged from production — and run them against every new model version or system prompt change. The jailbreaks.fyi ↗ catalog and jailbreakdb.com ↗ are useful starting points for building these suites.
- Red team/blue team loops. Anthropic and other frontier labs have published results from automated red-team cycles where one model generates attacks and a second is fine-tuned to resist them. For enterprise deployments, a simpler version — a dedicated red team running weekly probes with findings fed back to the system prompt team — achieves meaningful coverage at low cost.
Mapping Controls to OWASP and NIST
| Threat | Primary Control | Secondary Control |
|---|---|---|
| LLM01 Prompt Injection | Input guardrails + structural delimiters | Dual-LLM architecture |
| LLM02 Sensitive Disclosure | Output filtering | RAG access controls |
| LLM05 Improper Output Handling | Output encoding/sanitization | Static analysis on code output |
| LLM06 Excessive Agency | Least-privilege tool scoping | Ensemble approval for high-stakes actions |
| LLM08 Vector/Embedding Weaknesses | Index integrity checks | Taint tracking on retrieved chunks |
The NIST AI RMF’s Map-Measure-Manage-Govern cycle maps cleanly onto this stack: Map your threat model to the OWASP list, Measure attack surface with automated red teaming, Manage with layered controls, and Govern with continuous evaluation.
Practical Deployment Order
If you are starting from scratch, this order gives the fastest improvement per hour of engineering effort:
- Deploy an input guardrail classifier (Llama Guard or equivalent) in front of all user inputs.
- Harden the system prompt with explicit instruction hierarchy and refusal rules.
- Add structural delimiters around all untrusted content (user input, RAG chunks, tool responses).
- Scope tool permissions to the minimum required per conversation.
- Add output sanitization before any downstream execution.
- Set up automated adversarial testing against a staging environment.
- Introduce dual-model separation for any agent that touches external content.
None of these steps require a new model. Most can be shipped as configuration and middleware changes. The residual risk after all seven is real but bounded — which is the realistic goal for any defense-in-depth strategy.
Sources
-
OWASP Top 10 for LLM Applications 2025 ↗ — The canonical LLM threat taxonomy, updated annually by 500+ contributors. Essential reading for any team building on or with LLMs.
-
tldrsec/prompt-injection-defenses ↗ — Comprehensive catalog of every practical and proposed defense against prompt injection, organized by technique family with links to original research.
-
NIST AI 600-1: Generative AI Profile ↗ — NIST’s July 2024 companion to the AI RMF, tailored specifically to generative AI risks including confabulation, data privacy, and CBRN information exposure. Provides the governance scaffolding for operationalizing the technical controls above.
Sources
AI Defense — in your inbox
Defensive AI engineering — guardrails, hardening, response. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Output Filtering Architecture for Production LLMs: Semantic Classifiers, Regex Guards, and LLM-as-Judge
A deep-dive into layered output filtering for production LLMs — combining semantic classifiers, regex scrubbing, and LLM-as-judge techniques to catch harmful, policy-violating, and hallucinated content before it reaches users or downstream systems.
Monitoring LLM Outputs in Production: Anomaly Detection, Latency Alerting, and Output Drift
How to build a production observability stack for LLM outputs — covering anomaly detection pipelines, latency threshold alerting, output drift signals, and concrete alerting logic you can deploy today.
Prompt Injection Prevention: System Prompt Hardening, Instruction Hierarchy, and Privilege Separation
A technical guide to preventing prompt injection attacks in production LLMs — covering system prompt hardening, privilege-separated architectures, instruction hierarchy, and defense-in-depth patterns with vulnerable vs. hardened code examples.