Guardrails: I’m afraid I can’t do that
- Jürg Stuker
- 25. Apr.
- 1 Min. Lesezeit
Aktualisiert: 26. Apr.
With «A practical guide to building agents» OpenAI provides us with an interesting weekend lecture. Let me focus on just one aspect: Guardrails.
Their paper shows a fairly simple architecture of agents working together, but their approach to controlling risk is quite sophisticated. While they propose a single agent that validates output (preventing output that could damage the integrity of your brand), they propose more than six (one additional per tool) on the input side.
Here the overview of the input agents responsible for the guardrail:
Relevance classifier: Ensure to stay within the intended scope by flagging off-topic queries.
Safety classifier: Detects unsafe inputs (jailbreaks or prompt injections).
PII filter: Prevents exposure of personally identifiable information.
Moderation: Flags harmful or inappropriate inputs (hate speech, harassment, violence).
Tool safeguards: Assess the risk of each tool available to your agent by assigning a rating—low, medium, or high—based on factors like read-only vs. write access, reversibility, required account permissions, and financial impact.
Rules-based protections: Simple deterministic measures (blocklists, input length limits, regex filters) to prevent known threats like prohibited terms or SQL injections.
So it is likely that there will be more guardrail agents than tools. At least if guardrails are a critical part of your deployment. The flow they propose is again very simple again: Flag if suspicious and reply with “cannot process, try again”.

My change request to OpenAI. Change the unsafe output reply in the documentation to: “I'm sorry, Dave. I’m afraid I can’t do that”. So we all feel at home.