Choosing Runtime Guardrails for LLM Apps: A Decision Framework
There is no single 'best' LLM guardrail. A decision framework for selecting runtime guardrails by threat, placement, and latency budget — comparing rules, classifiers, LLM-as-judge, and safety models, mapped to the OWASP LLM Top 10 risks they mitigate.
The question “which guardrail should we use?” is malformed. It assumes guardrails are interchangeable products you pick one of, like a database. They are not. A guardrail is a control mapped to a specific threat, placed at a specific point in the request lifecycle, paying a specific latency cost. Picking well means decomposing the question into three: what am I defending against, where does the control sit, and how much latency can I spend there. This post is a decision framework for those three axes, with the trade-offs that distinguish the control types in practice.
Start From the Threat, Not the Tool
The fastest way to build an expensive, ineffective guardrail stack is to start from a tool’s feature list. Start instead from the risks you actually carry. The OWASP Top 10 for LLM Applications 2025 ↗ is the most useful inventory because it is grounded in real-world incidents and reordered against community feedback. The entries that map most directly to runtime guardrails:
- LLM01: Prompt Injection — still the top risk for the second consecutive edition, because models process instructions and data in the same channel. Mitigations are input inspection plus architectural separation; no single classifier “solves” it.
- LLM02: Sensitive Information Disclosure — PII and secrets leaking from training data, context, or the prompt. Mitigation is input/output PII detection and redaction.
- LLM05: Improper Output Handling — unvalidated model output flowing into downstream systems (SQL, shell, rendered HTML). Mitigation is output schema enforcement and encoding.
- LLM06: Excessive Agency — the model can take consequential actions through tools. Mitigation is authorization at the tool boundary, not at the prompt.
- LLM07: System Prompt Leakage — a new 2025 entry; the model reveals its instructions. Mitigation is output scanning plus not putting secrets in the system prompt in the first place.
Note what falls out immediately: several of the highest risks (Excessive Agency, parts of Prompt Injection) are not solved by any content guardrail. They are architectural. A guardrail that classifies text cannot stop a model from calling a tool it shouldn’t have access to. That decision — what can’t be guardrailed at the content layer — is the most important output of the threat-first approach, because it stops you from buying a content filter to fix an authorization bug.
The Four Control Types and What They Cost
Once you know the threat, the control type follows from the kind of judgment it requires. Four types cover almost everything:
Deterministic rules (regex, denylists, schema validators). Sub-millisecond, fully explainable, zero false negatives on exact patterns, and trivially evaded by paraphrase. Use them where the thing you’re catching has a stable signature: structured PII (SSNs, card numbers), known-bad tokens, JSON/format validation on output. They are the cheapest layer and should always be first. Their weakness is semantics — they don’t understand meaning, so they miss everything that isn’t a literal match.
Lightweight ML classifiers. Purpose-trained small models for a narrow decision: is this prompt-injection-like, is this toxic, is this jailbreak-shaped. Typical added latency is in the tens of milliseconds, which is acceptable in most synchronous flows. They catch semantic variants that rules miss. The trade-off is calibration: every classifier has a false-positive/false-negative curve, and the threshold you pick is a product decision, not a default. Examples include Meta’s Llama Guard ↗ family for content-policy classification and dedicated prompt-injection classifiers.
Hosted moderation APIs. Provider-run classifiers like the OpenAI Moderation API ↗ cover broad content-safety categories (hate, self-harm, sexual content, violence) with no model to host. They are the lowest operational cost for content-policy coverage. The trade-offs: a network round-trip in your critical path, a fixed taxonomy you can’t extend to domain-specific abuse, and sending content to a third party — which may be disqualifying for regulated or sensitive data.
LLM-as-judge. A full model call that evaluates the input or output against a written policy. This is the most flexible (it can reason about novel, context-dependent violations) and by far the most expensive — you are paying a second inference, often hundreds of milliseconds to seconds, and the judge itself is susceptible to injection. Reserve it for high-stakes, low-volume decisions or as an asynchronous audit pass, not for inline filtering of every request.
The selection logic across these is mostly economic: use the cheapest control type that can make the required judgment reliably. Don’t spend an LLM-judge call on something a regex catches; don’t expect a regex to catch something that requires understanding intent.
Placement: Input, Output, and the Tool Boundary
The same threat often needs controls at more than one point. Placement is the second axis:
- Pre-model (input). Cheapest place to reject — you save the inference entirely. Best for direct prompt-injection patterns, oversized inputs, and inbound PII you don’t want sent to a third-party model. Limited because it cannot see what the model will do with the input.
- Post-model (output). The last catch before a response reaches a user or downstream system. Best for output-handling risks, residual PII/secret leakage, system-prompt-leakage artifacts, and policy scoring of generated content. The cost is that the inference has already happened, so output guardrails don’t save compute; they prevent harm from delivery.
- Tool boundary. For agentic systems this is the control that matters most and the one content filters can’t provide. Authorization checks, allowlisted actions, and human-in-the-loop confirmation for consequential calls live here. Excessive Agency is mitigated by what the model is permitted to invoke, enforced in your code, not by inspecting the text it produced.
A guardrail placed at the wrong layer creates a false sense of coverage. Scanning user input for injection does nothing about indirect injection arriving through a retrieved document; that needs output scanning and tool-boundary authorization. Mapping each OWASP risk to its correct placement — and accepting that some need two — is the work.
Spending the Latency Budget
The third axis is what makes a stack shippable. Two patterns keep guardrails from dominating your p99:
Run guardrails concurrently with the main call, not in series. Fire the input guardrail and the model inference in parallel; if the guardrail trips, cancel the inference and return the fallback. If it clears, you’ve added near-zero latency. This async pattern is the single highest-leverage performance decision in a guardrail stack.
Tier by cost, short-circuit early. Order controls cheapest-first and stop at the first block. A regex that rejects in sub-millisecond should run before a classifier; the classifier should run before any LLM-judge pass. Most malicious traffic is commodity and dies at the cheap layers, so the expensive layers only see the residue.
For systems that can tolerate it, the LLM-judge layer is best run asynchronously as an audit — it doesn’t gate the response, but it flags suspicious interactions for review and feeds your detection metrics. That keeps the flexible-but-slow control out of the latency-critical path while still extracting its value.
A Worked Selection
For a typical RAG support assistant with tool access, the threat-first stack lands at:
| Threat (OWASP) | Control type | Placement | Latency tier |
|---|---|---|---|
| Prompt injection (LLM01) | Rules + classifier | Input and output (indirect) | Cheap, concurrent |
| Sensitive info disclosure (LLM02) | PII detector (rules + ML) | Input and output | Cheap |
| Improper output handling (LLM05) | Schema/encoding validator | Output | Sub-ms |
| Excessive agency (LLM06) | Authorization + allowlist | Tool boundary | In-code, ~free |
| System prompt leakage (LLM07) | Output artifact scan | Output | Cheap |
| Novel policy violations | LLM-as-judge | Output, async audit | Expensive, off-path |
For conversational agents with strict topical boundaries, a framework like NeMo Guardrails ↗ can express dialog and topical rails declaratively and integrate several of these control types behind one configuration — at the cost of a learning curve and its own latency profile. For structured-output use cases, a validator-centric framework fits better than a dialog-rail one. The framework is a packaging decision; it does not change the underlying threat-to-control mapping.
The discipline this framework enforces is honesty about what each control can and cannot do. A guardrail stack assembled from a vendor’s feature list tends to over-cover the easy, classifiable threats and under-cover the architectural ones. Starting from the threat, choosing the cheapest sufficient control, placing it correctly, and spending latency deliberately produces a stack that is both defensible and shippable — which is the whole point of implementing guardrails rather than just naming them.
Sources
- OWASP Top 10 for LLM Applications 2025 ↗ — The risk inventory used to map threats to controls; defines LLM01–LLM10 including the 2025 additions (System Prompt Leakage, Vector and Embedding Weaknesses).
- NVIDIA NeMo Guardrails — Overview ↗ — Declarative dialog/topical rails framework; useful where conversational boundaries are well defined.
- Meta Llama Guard 3 — Model Card ↗ — A fine-tuned content-policy classifier that can be self-hosted as an input/output guardrail.
- OpenAI Moderation API ↗ — Hosted content-safety classifier covering standard harm categories with no model to operate.
Sources
AI Defense — in your inbox
Defensive AI engineering — guardrails, hardening, response. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
LLM Guardrails Implementation: A Guide to Production Controls
How to implement LLM guardrails across input validation, output filtering, and runtime enforcement — with concrete patterns, tooling comparisons, and latency trade-offs for production deployments.
How LLM Guardrails Work: Architecture, Detection, and Trade-offs
A technical breakdown of how LLM guardrails work — the six pipeline layers, classifier mechanics, latency costs, and the residual risks that no single control eliminates.
Securing the ML Model Supply Chain: Provenance, Signing, and Verification
Model weights are unauthenticated binaries that execute code on load. This is a practical guide to securing the ML supply chain with model signing, Sigstore, SLSA provenance, and load-time verification — with the failure modes that make scanning insufficient on its own.