How do I add circuit breakers that automatically disable a misbehaving AI agent?

Question

Accepted Answer

To automatically disable a misbehaving AI agent, implement runtime controls that include verification, enforcement, and dynamic intervention capabilities. These controls act as a last line of defense, catching issues that pre-deployment evaluations might miss. Implement an LLM Gateway and Tool Broker: Place an LLM gateway in front of every model invocation to enforce authentication, apply content policies, perform PII detection and redaction, rate-limit, and emit telemetry. Similarly, route all tool invocations through a tool broker that validates calls against the agent’s identity, active intent, and policy. This ensures complete mediation, where every action passes through a verification point. Utilize Sandboxing for Code Execution: For any tool that executes generated code or processes untrusted data, ensure it operates within a properly isolated sandbox, such as containers with strict resource limits, no outbound network access except through the broker, ephemeral filesystems, and no access to the agent’s credentials. This addresses the OWASP LLM Top 10 risk of "Container escape from sandboxed code execution" (L4). Enforce Rate Limits and Resource Exhaustion Controls: Implement per-task and per-agent budgets, circuit breakers, and timeout enforcement to prevent "Rate limit and resource exhaustion" (L3, L4). Implement Output Schema Validation and Intent Re-verification: Validate that tool calls produce structured output as expected, as malformed responses can indicate prompt injection. Before any consequential action, re-verify the agent's intent against the originally attested intent to mitigate "Goal misalignment cascades" (L3 → L7). Provide Human Oversight and Override Mechanisms: Design real-time override mechanisms, or "stop buttons," that allow authorized humans to halt an agent's execution reliably and promptly, leaving the system in a coherent state. Implement deadman switches to pause agent fleets if communication is lost, defaulting agents to a safe state. These controls align with the NIST AI RMF function of Human Oversight. Establish Escalation Policies: Route specific situations to humans based on predefined policies, such as escalating medical diagnoses to a physician or financial advice above a certain threshold to a licensed advisor. This is part of the architecture, not a runtime decision.

How do I add circuit breakers that automatically disable a misbehaving AI agent?

How does your AI agent score?

Related questions