The 3-Tier Routing Cascade: Rule-Based → Semantic → LLM

Ellie Nguyen

24 Apr 2026 • 7 min read

The most common architecture mistake in multi-agent AI systems is treating routing as a single-step problem. Route everything through a language model call, or route everything through embedding similarity, and you get either high latency on every request or poor accuracy on ambiguous ones.

Nova OS uses a three-tier cascade that avoids this tradeoff. It tries the cheapest routing method first, escalates to a more expensive one only when the cheap method can't produce a confident match, and exits as soon as any tier reaches a 0.8 confidence threshold. The result: most requests route in milliseconds, and hard requests route accurately — without paying for accuracy on requests that don't need it.

This article explains each tier, what it does internally, when it triggers, and why the cascade architecture outperforms any single-method approach.

The Core Principle: Confidence-Gated Escalation

The cascade is not a pipeline where every request passes through all three tiers. It is an escalation chain where a request moves to the next tier only if the current tier fails to produce sufficient confidence.

Request → [Tier 1: Condition Router] → confidence ≥ 0.8? → Route
                                    ↓ no
              [Tier 2: Semantic Matcher] → confidence ≥ 0.8? → Route
                                       ↓ no
                    [Tier 3: LLM Router] → Route

Every tier produces a RouteDecision with three fields: agent_id, confidence, and method. The cascade checks the confidence score against the threshold (default 0.8, operator-configurable). If it passes, routing completes. If not, the next tier runs.

This means the LLM router — the accurate but slow tier — only fires on requests that genuinely require it. In a well-configured deployment, the majority of requests exit at Tier 1 or Tier 2. The LLM call is the last resort, not the default.

Tier 1: The Condition Router (5–20ms)

The first tier is entirely rule-based. It evaluates a set of routing rules against the incoming request and returns a match if any rule's predicates are satisfied.

What a routing rule looks like:

Each rule is a set of predicates evaluated against the request:

Keyword presence — does the request contain this word or phrase?
Entity type — does the input include a contract, invoice, date, dollar amount?
Channel source — is this request coming from the Chatwoot integration, the API, a specific user group?
Pattern match — does the text match a regex?

Rules include a target agent reference and a confidence value that gets returned in the RouteDecision. A rule that matches on an exact phrase like "redline this contract" can return confidence 1.0 — that's unambiguous.

LRU cache on pattern matches:

The condition router maintains an LRU (least-recently-used) cache of recent predicate evaluations. If the same or similar pattern was evaluated in a recent request, the cached result is returned without re-evaluating the predicates. On a cache hit, latency drops to microseconds.

What makes a good Tier 1 rule:

Tier 1 rules work best for requests that follow predictable patterns — structured commands, form-like inputs, channel-specific request types. Enterprise deployments in regulated industries often have consistent request vocabularies: "analyze this policy document," "check clause 4.2 for compliance," "generate the Q3 financial summary." These pattern well as condition rules.

Tier 1 failure mode:

The condition router fails gracefully. If no rule matches with sufficient confidence, it returns a low-confidence RouteDecision and the cascade continues to Tier 2. Nothing breaks — the cascade was designed for exactly this case.

Tier 2: The Semantic Matcher (20–50ms)

When the condition router doesn't produce a confident match, the semantic matcher runs. Instead of rules, it uses multi-criteria ranking against all registered agent profiles.

How semantic matching works:

Each agent in the system has a profile with a description, a set of capability tags, and an embedding. When a request arrives, the semantic matcher:

Embeds the request text
Computes cosine similarity against all agent profile embeddings
Applies multi-criteria ranking using four weighted criteria

Criterion	Routing weight
Description similarity (semantic)	0.40
Capability match	0.30
Tag overlap	0.15
Trust score	0.15

The description similarity uses the embedding comparison — how semantically close is the request to the agent's described purpose. Capability match checks whether the required capabilities of the task (extracted by the InputContextAnalyzer) are listed in the agent's capability set. Tag overlap is a lightweight exact-match boost for domain tags.

The final ranking formula (used to select among candidates that passed the criteria scoring) weighs:

Factor	Final selection weight
Semantic similarity	0.60
Credibility (trust)	0.20
Availability	0.10
Cost	0.10

This two-pass approach — first score candidates on criteria, then rank by the selection formula — ensures that the agent selected is both relevant (semantic similarity) and reliable (credibility), while accounting for current load and model cost.

The trust score in detail:

Trust is not static. It is a composite of each agent's running performance history:

Component	Weight
Success rate	40%
Latency	20%
Error rate	25%
Data freshness	15%

An agent that handles its routed requests successfully and quickly earns a higher trust score over time. An agent with a recent spike in errors or timeouts earns a lower trust score and gets ranked below alternatives. This feedback loop means routing quality improves as the system accumulates production history — newly configured agents with no history start at neutral scores; agents with track records are ranked by it.

Agent availability and publish state:

The semantic matcher only considers agents whose published field is set to true. Unpublished agents — those under development, or temporarily disabled — are excluded from routing entirely, even if their semantic profile would match. This gives operators full control over which agents participate in routing without needing to delete or modify their profiles.

Tier 2 confidence:

The semantic matcher produces a confidence score based on how clearly the top-ranked agent separates from the next-best candidate. A request that maps cleanly to one agent's profile returns high confidence. A request where two or three agents score similarly returns lower confidence — indicating genuine ambiguity — and the cascade escalates to Tier 3.

Tier 3: The LLM Router (500–2000ms)

The third tier is an LLM-based intent classification call. It runs only when neither the condition router nor the semantic matcher reaches the confidence threshold — typically on requests that are ambiguous across domains, novel in phrasing, or multi-part in nature.

What the LLM router does:

The router sends the request to a small, fast language model with a structured classification prompt. The model returns a JSON response:

{
  "agent_id": "legal.compliance_checker",
  "confidence": 0.91,
  "reasoning": "Request asks for regulatory compliance review of a contract provision."
}

The agent_id directly maps to a registered agent. The confidence is used by the cascade to produce the RouteDecision. The reasoning field is stored in the call log for observability.

Why a small model:

The LLM router is not the same model that executes the task. It is a small, fast model whose only job is classification — determining which agent should handle the request. Using a smaller, faster model for this purpose keeps Tier 3 latency in the 500–2000ms range rather than the full inference cost of a production-grade model.

Tier 3 always produces a result:

Unlike Tiers 1 and 2, which may return insufficient confidence and trigger escalation, Tier 3 always returns an agent selection. It is the final tier — there is no Tier 4. If the LLM router produces a low-confidence result, the cascade uses it anyway and records the low confidence in the routing decision log. Operators can audit low-confidence Tier 3 decisions and use them to improve Tier 1 rules or Tier 2 agent profiles.

How the Input Analyzer Feeds the Cascade

Before any tier runs, the InputContextAnalyzer processes the request and produces signals that all three tiers use:

Complexity score:

(tokens/100) × 0.3 + (entities/5) × 0.2 + (tools/3) × 0.3 + (depth/3) × 0.2

Requests below 1.0 are classified as simple. Requests at or above 1.0 are complex. Complex requests bypass the small LLM router fast path and go directly into semantic matching, because they are more likely to require nuanced agent selection.

Intent and entities:

The extracted intent and entities feed directly into the Tier 1 predicate matching (entity type checks) and the Tier 2 capability matching (required capabilities derived from intent). The analysis runs once and its output is shared across all three tiers — there is no redundant classification at each stage.

The Fallback Chain After Routing

After the cascade selects an agent, the discovery layer and resilience layer extend the routing decision:

If the selected agent is BUSY or OFFLINE, the FallbackChain activates — an ordered list of alternative agents configured for each primary agent. Fallbacks are tried in order until an available agent is found.

If the selected agent fails mid-execution (error, timeout, or circuit breaker open), PlanRepair re-plans the remaining task steps through the next available agent.

Every routing decision — primary selection, confidence score, method used (condition/semantic/llm), fallback activations — is recorded in the call log. These records feed back into the trust score calculation, completing the improvement loop.

Why Three Tiers, Not One

The alternative to a cascade is a single-method router. The tradeoffs are straightforward:

Rules only (Tier 1): Fastest, but fails on novel phrasing and multi-domain requests. Rules become brittle as the request vocabulary expands. Production systems accumulate hundreds of edge cases that rules don't cover.

Semantic only (Tier 2): More flexible than rules, but semantic similarity doesn't distinguish well between agents with overlapping domains. A request about "financial compliance in a contract" lands in both Finance and Legal; embeddings alone don't resolve that.

LLM only (Tier 3): Most accurate, handles ambiguity, but adds 500–2000ms to every request regardless of complexity. For a system processing thousands of requests per day, this latency cost is not acceptable.

The cascade architecture gets the speed of rules for requests that fit patterns, the flexibility of semantics for requests that don't, and the accuracy of an LLM for the subset that genuinely requires it. The 96% routing accuracy achieved on evaluation sets reflects the combination — no single tier would reach that number alone.

Configuring the Cascade

The confidence threshold (default 0.8) is operator-configurable. Lowering it increases speed — more requests exit at cheaper tiers — but reduces accuracy. Raising it forces more requests through to Tier 3 for higher accuracy at higher latency cost.

Condition routing rules are defined at deployment time and updated through the operator dashboard. Well-configured Tier 1 rules are the most impactful optimization: every request that exits at Tier 1 avoids the embedding computation of Tier 2 and the LLM call of Tier 3.

Agent profiles — the descriptions, capability tags, and embeddings that Tier 2 uses — are managed through the agent registry. Operators who write clear, specific agent descriptions and accurate capability tags directly improve Tier 2 routing quality.

The trust score adjusts automatically. No operator action is required — routing quality improves as agents process more requests.

Learn more about Nova OS →

Stay Connected

💻 Website: meganova.ai

📖 Docs: docs.meganova.ai

✍️ Blog: Read our Blog

🐦 Twitter: @meganovaai

🎮 Discord: Join our Discord