AI Defense
Isometric vector illustration showing interconnected security shields and code blocks for AI safety and LLM guardrails.
Defense

Choosing Runtime Guardrails for LLM Apps: A Decision Framework

There is no single 'best' LLM guardrail. A decision framework for selecting runtime guardrails by threat, placement, and latency budget — comparing rules, classifiers, LLM-as-judge, and safety models, mapped to the OWASP LLM Top 10 risks they mitigate.

By AI Defense Editorial · · 8 min read

The question “which guardrail should we use?” is malformed. It assumes guardrails are interchangeable products you pick one of, like a database. They are not. A guardrail is a control mapped to a specific threat, placed at a specific point in the request lifecycle, paying a specific latency cost. Picking well means decomposing the question into three: what am I defending against, where does the control sit, and how much latency can I spend there. This post is a decision framework for those three axes, with the trade-offs that distinguish the control types in practice.

Start From the Threat, Not the Tool

The fastest way to build an expensive, ineffective guardrail stack is to start from a tool’s feature list. Start instead from the risks you actually carry. The OWASP Top 10 for LLM Applications 2025 is the most useful inventory because it is grounded in real-world incidents and reordered against community feedback. The entries that map most directly to runtime guardrails:

  • LLM01: Prompt Injection — still the top risk for the second consecutive edition, because models process instructions and data in the same channel. Mitigations are input inspection plus architectural separation; no single classifier “solves” it.
  • LLM02: Sensitive Information Disclosure — PII and secrets leaking from training data, context, or the prompt. Mitigation is input/output PII detection and redaction.
  • LLM05: Improper Output Handling — unvalidated model output flowing into downstream systems (SQL, shell, rendered HTML). Mitigation is output schema enforcement and encoding.
  • LLM06: Excessive Agency — the model can take consequential actions through tools. Mitigation is authorization at the tool boundary, not at the prompt.
  • LLM07: System Prompt Leakage — a new 2025 entry; the model reveals its instructions. Mitigation is output scanning plus not putting secrets in the system prompt in the first place.

Note what falls out immediately: several of the highest risks (Excessive Agency, parts of Prompt Injection) are not solved by any content guardrail. They are architectural. A guardrail that classifies text cannot stop a model from calling a tool it shouldn’t have access to. That decision — what can’t be guardrailed at the content layer — is the most important output of the threat-first approach, because it stops you from buying a content filter to fix an authorization bug.

The Four Control Types and What They Cost

Once you know the threat, the control type follows from the kind of judgment it requires. Four types cover almost everything:

Deterministic rules (regex, denylists, schema validators). Sub-millisecond, fully explainable, zero false negatives on exact patterns, and trivially evaded by paraphrase. Use them where the thing you’re catching has a stable signature: structured PII (SSNs, card numbers), known-bad tokens, JSON/format validation on output. They are the cheapest layer and should always be first. Their weakness is semantics — they don’t understand meaning, so they miss everything that isn’t a literal match.

Lightweight ML classifiers. Purpose-trained small models for a narrow decision: is this prompt-injection-like, is this toxic, is this jailbreak-shaped. Typical added latency is in the tens of milliseconds, which is acceptable in most synchronous flows. They catch semantic variants that rules miss. The trade-off is calibration: every classifier has a false-positive/false-negative curve, and the threshold you pick is a product decision, not a default. Examples include Meta’s Llama Guard family for content-policy classification and dedicated prompt-injection classifiers.

Hosted moderation APIs. Provider-run classifiers like the OpenAI Moderation API cover broad content-safety categories (hate, self-harm, sexual content, violence) with no model to host. They are the lowest operational cost for content-policy coverage. The trade-offs: a network round-trip in your critical path, a fixed taxonomy you can’t extend to domain-specific abuse, and sending content to a third party — which may be disqualifying for regulated or sensitive data.

LLM-as-judge. A full model call that evaluates the input or output against a written policy. This is the most flexible (it can reason about novel, context-dependent violations) and by far the most expensive — you are paying a second inference, often hundreds of milliseconds to seconds, and the judge itself is susceptible to injection. Reserve it for high-stakes, low-volume decisions or as an asynchronous audit pass, not for inline filtering of every request.

The selection logic across these is mostly economic: use the cheapest control type that can make the required judgment reliably. Don’t spend an LLM-judge call on something a regex catches; don’t expect a regex to catch something that requires understanding intent.

Placement: Input, Output, and the Tool Boundary

The same threat often needs controls at more than one point. Placement is the second axis:

  • Pre-model (input). Cheapest place to reject — you save the inference entirely. Best for direct prompt-injection patterns, oversized inputs, and inbound PII you don’t want sent to a third-party model. Limited because it cannot see what the model will do with the input.
  • Post-model (output). The last catch before a response reaches a user or downstream system. Best for output-handling risks, residual PII/secret leakage, system-prompt-leakage artifacts, and policy scoring of generated content. The cost is that the inference has already happened, so output guardrails don’t save compute; they prevent harm from delivery.
  • Tool boundary. For agentic systems this is the control that matters most and the one content filters can’t provide. Authorization checks, allowlisted actions, and human-in-the-loop confirmation for consequential calls live here. Excessive Agency is mitigated by what the model is permitted to invoke, enforced in your code, not by inspecting the text it produced.

A guardrail placed at the wrong layer creates a false sense of coverage. Scanning user input for injection does nothing about indirect injection arriving through a retrieved document; that needs output scanning and tool-boundary authorization. Mapping each OWASP risk to its correct placement — and accepting that some need two — is the work.

Spending the Latency Budget

The third axis is what makes a stack shippable. Two patterns keep guardrails from dominating your p99:

Run guardrails concurrently with the main call, not in series. Fire the input guardrail and the model inference in parallel; if the guardrail trips, cancel the inference and return the fallback. If it clears, you’ve added near-zero latency. This async pattern is the single highest-leverage performance decision in a guardrail stack.

Tier by cost, short-circuit early. Order controls cheapest-first and stop at the first block. A regex that rejects in sub-millisecond should run before a classifier; the classifier should run before any LLM-judge pass. Most malicious traffic is commodity and dies at the cheap layers, so the expensive layers only see the residue.

For systems that can tolerate it, the LLM-judge layer is best run asynchronously as an audit — it doesn’t gate the response, but it flags suspicious interactions for review and feeds your detection metrics. That keeps the flexible-but-slow control out of the latency-critical path while still extracting its value.

A Worked Selection

For a typical RAG support assistant with tool access, the threat-first stack lands at:

Threat (OWASP)Control typePlacementLatency tier
Prompt injection (LLM01)Rules + classifierInput and output (indirect)Cheap, concurrent
Sensitive info disclosure (LLM02)PII detector (rules + ML)Input and outputCheap
Improper output handling (LLM05)Schema/encoding validatorOutputSub-ms
Excessive agency (LLM06)Authorization + allowlistTool boundaryIn-code, ~free
System prompt leakage (LLM07)Output artifact scanOutputCheap
Novel policy violationsLLM-as-judgeOutput, async auditExpensive, off-path

For conversational agents with strict topical boundaries, a framework like NeMo Guardrails can express dialog and topical rails declaratively and integrate several of these control types behind one configuration — at the cost of a learning curve and its own latency profile. For structured-output use cases, a validator-centric framework fits better than a dialog-rail one. The framework is a packaging decision; it does not change the underlying threat-to-control mapping.

The discipline this framework enforces is honesty about what each control can and cannot do. A guardrail stack assembled from a vendor’s feature list tends to over-cover the easy, classifiable threats and under-cover the architectural ones. Starting from the threat, choosing the cheapest sufficient control, placing it correctly, and spending latency deliberately produces a stack that is both defensible and shippable — which is the whole point of implementing guardrails rather than just naming them.

Sources

Sources

  1. OWASP Top 10 for LLM Applications 2025
  2. NVIDIA NeMo Guardrails — Overview
  3. Meta Llama Guard — Model Card
  4. OpenAI Moderation API
Subscribe

AI Defense — in your inbox

Defensive AI engineering — guardrails, hardening, response. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments