Special Judge / Output Protocol Workflow¶

This page is for batch tasks where exact stdout diffing is the wrong trust model.

Trigger: special judge, custom validator, predicate-checked batch tasks, or many-valid-answers problems whose legality contract is already clear enough to implement a checker
Inputs needed: one candidate solution plus a checker, validator, predicate script, or score-aware harness
Output artifact: one reproducible local validation loop with at least one negative case and one accepted-looking positive case
Stop condition: legality and output protocol are separated cleanly from the solving logic
Pair with: Many-Valid-Answers / Validator-First Workflow, Local judge workflow, Anti-Hack Workflow

Use it when:

the task accepts many valid outputs
correctness depends on a predicate, not one reference output
the judge computes legality or score through custom logic
simple stdin/stdout replay does not tell you whether the output is actually valid

If the problem is still at the stage of "what exactly counts as a legal witness?", start one step earlier with Many-Valid-Answers / Validator-First Workflow.

Do not use this page if:

the task is fully interactive; use Local judge workflow
the task is an ordinary unique-answer batch task; use Stress testing workflow
the real issue is adversarial batch hacking after the idea is known; use Anti-Hack Workflow

Which Workflow To Use Right Now¶

ordinary unique-answer batch task -> Stress testing workflow
interactive simulator or transcript problem -> Local judge workflow
hack-sensitive constructive task -> start here, then pair with Anti-Hack Workflow
many-valid-answers task and the legality contract is still fuzzy -> Many-Valid-Answers / Validator-First Workflow
predicate-checked batch output or special judge with a clear contract -> this page

Separate these roles clearly:

If the same binary or script is implicitly doing all three jobs, debugging becomes too emotional and too noisy.

For most special-judge tasks, keep this split:

The goal is not to mimic the official judge perfectly. The goal is to reproduce the legality contract locally.

Before you trust the solution, fill this card:

Check	Your answer
exact output format
output domain
per-step legality rule
final acceptance predicate
if scored, what score means locally

If any row is blank, you are still comparing vibes, not contracts.

Default loop:

This is the shortest route to avoiding "samples passed, but the custom judge hated it."

When you do not know what to attack first, try:

If your local scorer is only a partial reconstruction of the official one:

Do not overfit to a fake local score model and call that correctness.

For many special-judge tasks, it helps to separate two checks:

This split catches a lot of "the answer idea is fine, but the serialized output is not."

Research snapshot refreshed on 2026-04-25.

Official / primary:

Repo anchors: