Skip to content

Special Judge / Output Protocol Workflow

This page is for batch tasks where exact stdout diffing is the wrong trust model.

  • Trigger: special judge, custom validator, predicate-checked batch tasks, or many-valid-answers problems whose legality contract is already clear enough to implement a checker
  • Inputs needed: one candidate solution plus a checker, validator, predicate script, or score-aware harness
  • Output artifact: one reproducible local validation loop with at least one negative case and one accepted-looking positive case
  • Stop condition: legality and output protocol are separated cleanly from the solving logic
  • Pair with: Many-Valid-Answers / Validator-First Workflow, Local judge workflow, Anti-Hack Workflow

Use it when:

  • the task accepts many valid outputs
  • correctness depends on a predicate, not one reference output
  • the judge computes legality or score through custom logic
  • simple stdin/stdout replay does not tell you whether the output is actually valid

If the problem is still at the stage of "what exactly counts as a legal witness?", start one step earlier with Many-Valid-Answers / Validator-First Workflow.

Do not use this page if:

Which Workflow To Use Right Now

Core Goal

Separate these roles clearly:

  1. solution
  2. validator / checker / scorer
  3. instance source

If the same binary or script is implicitly doing all three jobs, debugging becomes too emotional and too noisy.

Minimum Setup

For most special-judge tasks, keep this split:

  • sol.cpp
  • check.py or check.cpp
  • optional gen.cpp
  • optional oracle.cpp if a small exact model exists
  • one saved failing input or seed

The goal is not to mimic the official judge perfectly. The goal is to reproduce the legality contract locally.

Output Contract Card

Before you trust the solution, fill this card:

Check Your answer
exact output format
output domain
per-step legality rule
final acceptance predicate
if scored, what score means locally

If any row is blank, you are still comparing vibes, not contracts.

Validator-First Loop

Default loop:

  1. write the legality checker first
  2. produce one tiny invalid output by hand
  3. make sure the checker rejects it
  4. run the candidate solution on a tiny valid case
  5. only then scale into generated or larger tests

This is the shortest route to avoiding "samples passed, but the custom judge hated it."

Negative Test Families

When you do not know what to attack first, try:

  • correct format, wrong values
  • legal values, illegal move sequence
  • duplicated item in a purported set or permutation
  • missing item in a purported covering or partition
  • empty or one-element boundary
  • one-step-away-from-legal witness
  • same score-looking output that actually violates the predicate

Score-Aware Caution

If your local scorer is only a partial reconstruction of the official one:

  • trust legality first
  • treat score as directional only
  • save the exact local assumption the scorer is making

Do not overfit to a fake local score model and call that correctness.

Batch Output Protocol Split

For many special-judge tasks, it helps to separate two checks:

  1. protocol check
  2. did the output shape and token count follow the statement?
  3. meaning check
  4. does the output actually satisfy the predicate?

This split catches a lot of "the answer idea is fine, but the serialized output is not."

Done When

  • you can rerun one stable checker command without guessing
  • the checker rejects at least one deliberate invalid output
  • the candidate output passes the checker on at least one nontrivial case
  • legality is no longer being inferred from a raw diff against one reference file

Good Pairings

References And Repo Anchors

Research snapshot refreshed on 2026-04-25.

Official / primary:

Repo anchors: