AI Defense
AI defense techniques for LLMs — layered controls diagram
Defense

AI Defense Techniques for LLMs: A Practitioner's Guide to Securing Large Language Models

A technical breakdown of proven AI defense techniques for LLMs — from input guardrails and prompt hardening to dual-model architectures and red teaming, mapped to OWASP and NIST frameworks.

By AI Defense Editorial · · 8 min read

Securing a large language model deployment is not the same as securing a traditional application. The attack surface is probabilistic, the input space is unbounded, and the model itself can be turned against the system it runs in. The ai defense techniques llm practitioners actually reach for in production are a layered set of controls — none sufficient alone, all necessary together.

This guide covers the techniques that hold up under real adversarial pressure, organized from the perimeter inward.

Understanding the Threat Landscape First

Before reaching for a tool, you need a map. The OWASP Top 10 for LLM Applications 2025 — assembled by 500+ contributors — is the closest thing the industry has to a canonical threat taxonomy. The 2025 list includes:

Prompt injection (LLM01) has held the top position for three straight years. It is also the vulnerability with the widest defensive literature, which is why most practical defense stacks center on it. But treat the whole list as your requirements document: each entry maps to a control family.

Layer 1: Input Guardrails

The first line of defense is inspection before the model sees any content. Every prompt — including retrieved RAG documents, tool outputs, and web content fetched by agents — should pass through a classifier trained to detect adversarial patterns.

The tldrsec/prompt-injection-defenses repository catalogs the main approaches:

Paraphrasing and retokenization. Pass the raw input through a lightweight model that paraphrases it. This disrupts token-level injection patterns while preserving semantic content. Retokenization breaks tokens into smaller units, defeating attacks that rely on specific subword sequences.

LLM-as-a-classifier (PromptGuard pattern). Route incoming input to a separate, dedicated filter model before the primary model. The filter model has a single job: return ALLOW or BLOCK. This is the approach behind open models like Llama Guard, ShieldGemma, and IBM Granite Guardian. Independent benchmarks show these filters cut injection success rates by 60–70% on common attack suites.

Structural delimiters and spotlighting. Wrap user input in explicit XML-style delimiters (<user_input>...</user_input>) and instruct the model that content inside those tags is data, never instructions. The spotlighting variant uses a different encoding (e.g., base64 or a Unicode transformation) for untrusted content, making provenance visible to the model at inference time.

Preflight testing. Before sending a concatenated prompt to the production model, run a deterministic test prompt against the input. If the output is non-deterministic in ways that suggest instruction injection, reject the request.

Layer 2: Prompt Architecture and Privilege Separation

System prompt design is the second control surface. Key techniques:

Instruction hierarchy. Fine-tuned models (GPT-4o, Claude 3.x) expose an explicit privilege stack: system prompt > operator instructions > user input. Rely on this hierarchy rather than trying to out-argue a malicious user in the prompt. Place the most critical behavioral constraints in the system turn only.

Post-prompting. Place user input before the final instruction rather than after it. This exploits positional recency bias — the model’s final instruction is the one at the end of the context, which is your controlled system text, not user-supplied content.

Least privilege for tool access. If a model can call external APIs, send emails, or write to a database, apply LLM06’s lesson: strip any capability the model does not need for the current task. An agent answering FAQs should not have write access to the CRM. Scoping tool permissions per conversation turn — not per deployment — is the gold standard. See guardml.io for open-source guardrail libraries that implement capability gating.

Layer 3: Output Filtering and Improper Output Handling

Everything the model returns is untrusted until proven otherwise. LLM05 (Improper Output Handling) covers cases where model output is passed directly into SQL queries, shell commands, browser rendering, or downstream LLM calls without sanitization.

Controls here mirror traditional output encoding:

The LLMOps practices at llmops.report document operational patterns for logging and inspecting all input/output pairs in production, which is the prerequisite for any retrospective detection of successful attacks.

Layer 4: Dual-Model and Ensemble Architectures

For high-stakes deployments, a single model handling both trusted and untrusted content is a structural weakness. The dual-LLM pattern addresses this:

Data passes between the two models through a narrow typed interface — not raw text. This eliminates the mechanism by which injected instructions cross the trust boundary.

For decisions with irreversible consequences (sending an email, executing a trade, deleting a record), ensemble voting adds a second check: route the proposed action to two or three independent model instances and only proceed if they agree. Attacks that work on one model variant often fail on others.

Layer 5: Red Teaming and Continuous Evaluation

Static controls drift. Models get updated, prompts get modified, new attack patterns emerge. NIST AI 600-1, the Generative AI Profile of the AI Risk Management Framework, frames continuous adversarial testing as a core governance requirement, not a one-time exercise.

Practical red teaming for LLM deployments includes:

Mapping Controls to OWASP and NIST

ThreatPrimary ControlSecondary Control
LLM01 Prompt InjectionInput guardrails + structural delimitersDual-LLM architecture
LLM02 Sensitive DisclosureOutput filteringRAG access controls
LLM05 Improper Output HandlingOutput encoding/sanitizationStatic analysis on code output
LLM06 Excessive AgencyLeast-privilege tool scopingEnsemble approval for high-stakes actions
LLM08 Vector/Embedding WeaknessesIndex integrity checksTaint tracking on retrieved chunks

The NIST AI RMF’s Map-Measure-Manage-Govern cycle maps cleanly onto this stack: Map your threat model to the OWASP list, Measure attack surface with automated red teaming, Manage with layered controls, and Govern with continuous evaluation.

Practical Deployment Order

If you are starting from scratch, this order gives the fastest improvement per hour of engineering effort:

  1. Deploy an input guardrail classifier (Llama Guard or equivalent) in front of all user inputs.
  2. Harden the system prompt with explicit instruction hierarchy and refusal rules.
  3. Add structural delimiters around all untrusted content (user input, RAG chunks, tool responses).
  4. Scope tool permissions to the minimum required per conversation.
  5. Add output sanitization before any downstream execution.
  6. Set up automated adversarial testing against a staging environment.
  7. Introduce dual-model separation for any agent that touches external content.

None of these steps require a new model. Most can be shipped as configuration and middleware changes. The residual risk after all seven is real but bounded — which is the realistic goal for any defense-in-depth strategy.


Sources

Sources

  1. OWASP Top 10 for LLM Applications 2025
  2. tldrsec/prompt-injection-defenses — Every practical and proposed defense
  3. NIST AI 600-1: Generative AI Profile of the AI Risk Management Framework
#llm-security #prompt-injection #guardrails #ai-defense #red-teaming
Subscribe

AI Defense — in your inbox

Defensive AI engineering — guardrails, hardening, response. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments