Circuit Breakers in AI Agent Systems — Reliability at Scale

Circuit Breakers in AI Agent Systems — Reliability at Scale

An AI agent system that handles thousands of requests per day will encounter agent failures. Models time out. API rate limits trigger. Downstream services become temporarily unavailable. An agent that was working an hour ago may be degraded or unresponsive now.

The question isn't whether failures happen — they do, always. The question is whether the system continues to function when they do.

Nova OS implements circuit breakers on every agent in the mesh. Circuit breakers detect failure patterns, automatically stop sending requests to failing agents before the failures cascade, and restore service once the agent recovers. This article explains what circuit breakers are, how they work in Nova OS specifically, and why they matter for AI deployments at enterprise scale.


What a Circuit Breaker Does

The circuit breaker pattern comes from electrical engineering. A physical circuit breaker monitors current flow and opens the circuit when it detects a fault — protecting the system from damage and preventing the fault from spreading.

In software, a circuit breaker wraps a service call and monitors the call's success rate. When failures exceed a threshold, the circuit breaker "opens" — it stops forwarding calls to the failing service entirely. Callers get a fast failure response instead of waiting for the service to time out. When enough time has passed, the circuit breaker enters a "half-open" state and allows one test call through to check if the service has recovered.

For AI agent systems, this pattern is especially important. A single agent failure in a multi-step workflow can block downstream agents waiting for its output, cause cascading timeouts throughout the execution graph, and consume request budget on calls that will fail anyway.

A circuit breaker prevents all of this by detecting the failure early and rerouting before any of it happens.


Three States

Every agent in Nova OS has a circuit breaker that operates in one of three states:

CLOSED — Normal operation

All requests pass through to the agent. The circuit breaker monitors the agent's error rate in a rolling window. As long as the error rate stays below the configured threshold, the circuit remains closed and traffic flows normally.

This is the steady state for a healthy agent.

OPEN — Failure detected, agent bypassed

When the error rate crosses the threshold, the circuit opens. The agent is marked as unavailable and the router stops sending requests to it immediately. No further calls are made to the agent — they fail fast with a circuit-open response.

The FallbackChain activates. Requests that would have gone to this agent are routed to the next available agent in its configured fallback sequence. From the requester's perspective, the workflow continues — the circuit transition is not a user-visible failure.

HALF_OPEN — Recovery testing

After the circuit has been open for a configured timeout period, it transitions to HALF_OPEN. In this state, the circuit breaker allows one probe request through to the agent. Two outcomes:

  • If the probe succeeds, the circuit closes. The agent returns to normal operation and begins receiving traffic again.
  • If the probe fails, the circuit opens again and the timeout resets. The agent remains bypassed until the next recovery attempt.

The HALF_OPEN state prevents premature restoration of a still-failing agent. It also allows automatic recovery without operator intervention — the system detects recovery by testing it, not by waiting for a manual restart signal.


What Triggers a Circuit to Open

Nova OS classifies agent errors into two categories, and circuit breakers only track one of them.

Transient errors — failures that are expected to resolve on their own:

  • Rate limit responses (the API quota resets in a minute)
  • Timeouts (the model inference was slow this request)
  • Connection errors (momentary network interruption)

Transient errors trigger the retry policy, not the circuit breaker. The retry policy uses exponential backoff with jitter — base delay of 100–500ms, increasing on each attempt. The jitter prevents thundering-herd effects when many retries fire simultaneously after a rate limit.

Non-transient errors — failures that indicate a systemic problem:

  • Authentication failures (invalid or expired credentials)
  • Configuration errors (the agent is misconfigured)
  • Persistent model unavailability (the model endpoint is down)

Non-transient errors fail immediately without retry and count toward the circuit breaker's error rate calculation. A pattern of non-transient errors is a signal that something is structurally wrong with the agent — continuing to send requests is wasteful and delays recovery.

The distinction matters. An agent hitting rate limits occasionally is not failing — it's being throttled. The retry policy handles this cleanly. An agent with an authentication failure on every call is failing — the circuit breaker should open and stop sending traffic before those failures accumulate into a long queue of stalled requests.


Circuit Breakers and the Fallback Chain

When a circuit opens, the fallback chain determines where traffic goes. Every agent in the mesh can have a configured fallback sequence — an ordered list of alternative agents to try when the primary is unavailable.

For the Legal pack's Compliance Checker:

Primary:    legal.compliance_checker (circuit OPEN)
Fallback 1: legal.compliance_checker_b (available → receives traffic)
Fallback 2: legal.general_advisor (available, lower capability match)
Fallback 3: Dead Letter Queue (if all fallbacks exhausted)

The fallback chain is tried in order. The first available agent receives the request. If the fallback agent is also experiencing issues, its circuit breaker handles it independently — the fallback chain continues to the next option.

This design means a single agent failure doesn't require operator intervention. The system reroutes automatically and continues processing. Operators see the circuit transition in the monitoring dashboard and can investigate the underlying cause at their own pace — the system is already recovering.


Circuit Breakers During DAG Execution

When a complex task is running as a DAG and an agent fails mid-execution, the circuit breaker interacts with the plan repair system.

Two scenarios:

The circuit opens before the task starts. The DAGExecutor checks agent availability before dispatching each task. If the required agent's circuit is OPEN, the executor doesn't wait for it to fail — it routes to the fallback agent immediately and the DAG continues on the modified path.

The circuit opens during task execution. If an agent fails while executing a task that's part of a running DAG, PlanRepair activates. PlanRepair re-plans the remaining subtasks — it identifies which downstream tasks depended on the failed task's output, finds alternative agents for them, and resumes the DAG from the failure point. Work completed before the failure is preserved. Only the failed step and its downstream dependencies are re-planned.

The result: a DAG workflow that hits an agent failure does not fail entirely. It repairs, reroutes, and completes — with the circuit breaker preventing the same agent from being tried again until it's confirmed healthy.


The Dead Letter Queue

When all options are exhausted — the primary agent's circuit is open, every fallback agent is unavailable or has failed, and the retry budget is spent — the request goes to the dead letter queue.

The dead letter queue is not a discard bin. It is a durable storage of undeliverable requests with their full context: the original request, the agent selection history, the failure reasons at each step, and the timestamp.

Operators can inspect the dead letter queue, diagnose why the request couldn't be completed, and take action — whether that's restoring a failed agent, adjusting fallback configuration, or manually replaying the request once conditions improve.

Nothing is silently dropped. Every failure has a record.


Trust Score Interaction

Circuit breaker events feed into the trust score system. An agent whose circuit opens accumulates a spike in error rate — its trust score decreases. When the circuit closes and the agent returns to normal operation, successful completions rebuild the success rate component.

This means the trust score reflects actual reliability history, including circuit trips. An agent that has tripped its circuit breaker three times in the past week will carry a lower trust score than an equivalent agent that has been stable. The semantic matcher uses trust score as one of its ranking factors — the unstable agent gets lower priority in future routing decisions, independent of whether its circuit is currently open.

The trust score acts as a persistent memory of reliability. The circuit breaker acts as a real-time response to failure. Both operate simultaneously, at different timescales.


What Circuit Breakers Enable for Enterprise Deployments

Regulated enterprise environments have requirements that single-model AI systems can't meet without circuit breaker-style resilience:

Availability SLAs. An AI system processing HR, legal, or financial requests needs to function even when individual model endpoints are degraded. Circuit breakers mean one unavailable model doesn't take down the entire system.

Audit trails. Every circuit trip, fallback activation, and dead letter queue entry is logged. Compliance teams can trace exactly what happened when a task failed to complete — which agent was bypassed, what fallback handled it, what the outcome was.

Graceful degradation. A system with circuit breakers degrades gracefully — it processes what it can with available agents, queues what it can't, and recovers automatically. A system without circuit breakers either fails hard or processes requests slowly through cascading timeouts until nothing is working.

Operator confidence. When operators know the system self-repairs around agent failures, they can deploy changes and updates with less fear of cascading incidents. The circuit breaker is a safety net that makes the overall system more operable, not just more resilient.


The Configuration Surface

Circuit breakers in Nova OS have a minimal configuration surface:

  • Error rate threshold — the failure percentage that triggers the circuit to open (expressed as a ratio over a rolling window)
  • Minimum request volume — the circuit won't open based on a small number of requests, preventing spurious trips on low-traffic agents
  • Open timeout — how long the circuit stays open before transitioning to HALF_OPEN
  • Half-open probe count — how many successful probes are required before closing the circuit

These parameters are per-agent, allowing high-traffic agents and low-traffic agents to have appropriate sensitivity. A high-volume agent might have a tighter error rate threshold because the signal is cleaner. A low-volume agent might have a higher threshold to avoid false positives from statistical noise.

The defaults work for most deployments. Operators tune them when they have operational data that justifies it.


Reliability Is Architectural

Circuit breakers are not a feature that makes a fragile system seem more reliable. They are an architectural component that makes a reliable system provably resilient to a specific class of failures — agent-level degradation at runtime.

An AI system processing thousands of enterprise requests per day will encounter agent failures. The question is whether those failures are visible to users and require operator intervention to resolve, or whether they're handled automatically by a system designed to route around them.

Nova OS's circuit breakers, fallback chains, plan repair, and dead letter queue work as a layered resilience stack. Each layer handles what the layer above it couldn't. Together they produce a system where individual agent failures are operational events — not incidents.

Learn more about Nova OS →

Stay Connected

💻 Website: meganova.ai

📖 Docs: docs.meganova.ai

✍️ Blog: Read our Blog

🐦 Twitter: @meganovaai

🎮 Discord: Join our Discord