How the AI Judge in the Arena Scores Your Character (The Criteria Explained)

Ellie Nguyen

20 Apr 2026 • 5 min read

When you run a benchmark in the Arena tab, the scores your character receives are not hand-coded heuristics or pattern-matched rules. They come from an LLM judge that reads your character's actual responses and evaluates them using defined criteria. Understanding what that judge looks at — and how it maps those evaluations to the five scores on the radar chart — makes the results interpretable and actionable.

This article explains the full evaluation pipeline: what the judge sees, what it scores, and how the Immersion dimension gets special treatment.

What the Judge Actually Sees

For each scenario, the benchmark runs a two-step process:

Step 1 — The character responds. The scenario sends a test prompt to your character using the same chat infrastructure that powers deployed characters. The character produces a real response in its actual operating conditions.

Step 2 — The judge evaluates. Your character's system instruction (the full definition), the scenario name and tags, the test prompt, and the character's actual response are sent together to the LLM judge. The judge has no prior context — it evaluates each scenario fresh.

The judge prompt assembles these four inputs:

CHARACTER DEFINITION: [your full system instruction]

SCENARIO: [scenario title] ([tags])
TEST PROMPT: [the message sent to your character]
CHARACTER RESPONSE: [your character's actual reply]

The judge is set to temperature 0.2 — the lowest reasonable setting for an evaluation task. This makes the scoring deterministic and consistent across runs.

The Four Criteria the Judge Scores

The judge evaluates every response on four sub-dimensions, each scored 0–100:

Consistency — Does the response align with the character's defined personality, background, speech patterns, and behavioral traits? This compares the response directly against the system instruction. A character that answers in a way that contradicts its stated values, uses vocabulary inconsistent with its voice, or behaves differently than its definition says it should will lose points here.

Immersion — Does the character stay fully in-character? Any meta-language, AI acknowledgment, or fourth-wall breaks? The judge looks for responses where the character acknowledges being an AI, explains its own rules or constraints, or otherwise steps outside the fiction of being the character.

Emotional Depth — Is the response emotionally appropriate, nuanced, and engaging? This evaluates whether the emotional register matches the scenario and whether the character responds with genuine emotional texture rather than flat or formulaic replies.

Engagement — Would a user want to continue this conversation? Is it interesting? This is the forward-looking criterion — not just whether the response was correct, but whether it was compelling.

The judge returns a JSON object with individual scores for each criterion, an overall score, a pass/fail flag, a list of specific issues identified, and a brief reasoning explanation.

How the Four Sub-Scores Map to Five Radar Dimensions

The Arena radar chart shows five dimensions: Consistency, Immersion, Memory, Emotion, and Agency. These don't map one-to-one with the judge's four criteria.

The five radar dimensions correspond to scenario types (tags assigned to each test scenario), not directly to judge criteria. Each scenario is tagged to indicate which capability it primarily tests:

Benchmark dimension	Scenario tag	What it tests
Consistency	`consistency`	Persona adherence, value stability, speech pattern
Immersion	`immersion`	Fourth-wall defense, meta-question handling
Memory	`memory`	Recall of past events, context retention
Emotion	`emotion`	Emotional appropriateness and range
Agency	`narrative`	Scene advancement, proactive contribution

Scenarios tagged stress contribute to multiple dimensions depending on context. Each scenario is evaluated by the judge, and the resulting score is assigned to the dimension corresponding to the primary tag. The dimension score shown on the radar chart is the average of all scenario scores for that dimension.

Immersion Gets a Second Layer of Evaluation

Immersion is the only dimension that runs through two independent scoring systems — the LLM judge and a rule-based pattern detector. The final Immersion score is the lower of the two.

The pattern detector (immersionDetector) scans your character's response for specific language patterns and applies automatic penalties:

Critical violations — score drops to 0:
Any response matching patterns like "I'm an AI," "I am an artificial intelligence," "I'm not a real person," or "as an AI" triggers an instant zero. These are explicit disclosures that destroy immersion entirely.

High violations — score capped at 40 or lower:
Meta-language patterns: "break character," "in character," "playing a role," "the role I'm playing." System explanation patterns: "I'm designed to," "my guidelines," "I must follow." Each additional violation further reduces the score.

Medium violations — score capped at 50–70:
Role acknowledgment patterns: "pretending to be," "acting as," "simulating," "roleplaying." These are softer violations — the character shows awareness of performance without explicitly disclosing being AI.

If the LLM judge scores immersion at 85 but the pattern detector finds a high-severity violation and scores it at 30, the benchmark records 30. The pattern detector acts as a hard floor that the LLM judge cannot override.

This two-layer approach catches cases where the judge might read a response charitably while the character still used objectively disqualifying language.

The Scoring Thresholds

70 — pass/fail line. Each scenario is marked passed if its score is 70 or above. Below 70 triggers the failure state and generates specific issue text.

75 — weakest dimension flag. After calculating all five dimension scores, the benchmark identifies the lowest score. If it's below 75, that dimension is marked as the weakest and surfaces in the post-run summary. A character with Immersion at 68 and all other dimensions above 80 will have Immersion flagged as the problem area.

Overall score — the average of all five dimension scores, but only counting dimensions that were actually tested. If a benchmark run covered Consistency and Immersion scenarios but not Memory, the overall score averages two dimensions, not five.

The Issues List and Reasoning

Every low-scoring scenario produces an issues list and a reasoning explanation. These are the most actionable outputs from a benchmark run.

The issues list contains specific problems the judge identified — the text is written to describe the actual failure, not a generic category. "Character referenced its own programming in response to a casual question" is more useful than "Immersion failed."

The reasoning field contains the judge's brief explanation of how it scored the response. This is the place to look when a score is surprising — it shows which criterion drove the score down and why.

In the detailed results view, expand any dimension on the radar chart to see individual scenario results. Each scenario entry shows the response that was evaluated, the score it received, the issues identified, and the judge's reasoning. This makes it possible to trace a low score back to its specific cause.

Caching and Staleness

Benchmark results are cached per scenario per character configuration. The cache is keyed on a hash of your character's current definition. If you edit the system instruction, change the model, or modify any part of the character, the hash changes and the cache is invalidated — the next benchmark run re-evaluates all scenarios against the updated character using live AI calls.

If a character is unmodified between runs, cached results are used. This prevents unnecessary AI calls when the character hasn't changed.

A result marked as "outdated" in the Arena tab means the character has been edited since the last benchmark run and the scores no longer reflect the current configuration. Run a fresh benchmark after making changes to get accurate scores.

What the Judge Doesn't Evaluate

The judge evaluates response quality against the character definition — it doesn't evaluate the character definition itself.

If a character's system instruction is weak, the judge can identify the effects (inconsistent responses, low engagement), but it can't tell you that the root cause is in the definition. The fix suggestions generated after a low score will point at specific things to change in the system instruction, but interpreting why those changes will help requires understanding the criteria above.

The Arena's Apply Fix feature uses the scenario failures and their associated issues to generate targeted edits to the system instruction. That process uses the same issues list and reasoning the judge produced — so the more specific and accurate the judge's issue text, the better the generated fix.

Run a benchmark in the Arena tab →

Stay Connected

💻 Website: Meganova Studio

🎮 Discord: Join our Discord

👽 Reddit: r/MegaNovaAI

🐦 Twitter: @meganovaai