What this site is for
AI Defense covers defensive AI engineering — guardrails, content filters, and shipping AI features without shipping liability.
AI Defense exists for the engineers shipping LLM features who got handed a “make it safe” requirement with no playbook.
What we publish:
Guardrails that actually hold. Input filtering, output filtering, structured-output enforcement, refusal training, classifier-on-output patterns. What works in production, what breaks under adversarial pressure, what regresses silently when you upgrade the model.
Content moderation pipelines. Multi-stage filtering, prompt-classifier ensembles, the Llama Guard / NeMo Guardrails / OpenAI moderation API tradeoffs, building your own classifiers for domain-specific abuse patterns.
Defenses against the attacks the offensive side writes up. When a new prompt injection technique or jailbreak goes public, we publish the corresponding defensive pattern. The two angles pair intentionally.
Safety/utility tradeoffs. Refusal rate vs helpfulness. False positive cost vs liability. Where the line goes when you can’t have both. Honest about the tradeoffs, not pretending there isn’t one.
What we don’t publish:
- “AI safety is everyone’s responsibility” thinkpieces
- Vendor announcements as news
- Anything that pretends defense is solved
Pseudonymous bylines. Tips, corrections, and “this guardrail bypass works on prod” reports go to the editor.
Real content starts shortly.
AI Defense — in your inbox
Defensive AI engineering — guardrails, hardening, response. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
How LLM Guardrails Work: Architecture, Detection, and Trade-offs
A technical breakdown of how LLM guardrails work — the six pipeline layers, classifier mechanics, latency costs, and the residual risks that no single control eliminates.
Choosing Runtime Guardrails for LLM Apps: A Decision Framework
There is no single 'best' LLM guardrail. A decision framework for selecting runtime guardrails by threat, placement, and latency budget — comparing rules, classifiers, LLM-as-judge, and safety models, mapped to the OWASP LLM Top 10 risks they mitigate.
Securing the ML Model Supply Chain: Provenance, Signing, and Verification
Model weights are unauthenticated binaries that execute code on load. This is a practical guide to securing the ML supply chain with model signing, Sigstore, SLSA provenance, and load-time verification — with the failure modes that make scanning insufficient on its own.