Skip to Content
ReferenceBenchmark Schema

Benchmark Schema Reference

Manifest

FieldTypeRequiredDescription
idstringyesBenchmark pack identifier
versionintegeryesPack version (not semver; integer counter)
defaults.familystringnoDefault row family
defaults.environmentobjectnoMerged into each row
asset_roots.publicstringnoDefault assets/
asset_roots.evalstringnoDefault hidden/
asset_defaults.read_onlybooleannoDefault true

Asset root paths must be relative POSIX paths without .. or backslashes.

JSONL row

FieldTypeRequiredDescription
idstringyesTask ID
familystringconditionalRequired if not in manifest defaults
inputobjectnoPublic fields
evalobjectnoScoring fields
assetsobject[]noPublic file assets
environmentobjectnoOverrides manifest environment
metadataobjectnoOpaque metadata

Each JSONL line must be a single JSON object. Blank lines are skipped.

Asset object (public assets[])

Common fields used in packs:

FieldTypeDescription
pathstringRelative to asset_roots.public or eval root
mountstringWorkspace destination path
read_onlybooleanOverride default read-only behavior

Planned: shared asset-object validation for type, mode, etc. (docs/next.md).

Environment object

FieldTypeFamiliesDescription
imagestringcontainer-backedDocker image reference
workdirstringterminal_task, repo_patchContainer working directory
timeout_secondsnumberallPositive timeout
materialize_workdir_from_imagebooleanterminal_taskCopy image workdir to host workspace before harness

Family row schemas

Validators reject unknown fields in input and eval for active families.

multiple_choice

input: question (required), choices (required non-empty string array)

eval: answer (required string, number, or string/number array)

short_answer

input: question (required), answer_format, context (string or object)

eval: accepted_answers (required non-empty array), tolerance (non-negative number)

free_response

input: prompt (required), context (string or object)

eval: rubric (required string or object at row validation), reference_answer

Verifier requires rubric object with type: "contains_any", non-empty accepted_answers, optional rejected_answers, optional min_token_f1 (0–1)

code_completion

input: prompt (required), language, starter_code

eval: tests (required object with source: "inline" and code at verify time), reference_solution, canonical_solution

repo_patch

input: repo, base_commit, instructions (required), hints

eval: tests (required command object), gold_patch

tests object: source must be "command"; requires command; optional workdir, timeout_seconds, setup_patch, test_patch, candidate_policy

patch object: source: inline, patch: string

candidate_policy: allow_paths, allow_sensitive_paths (string arrays)

terminal_task

input: instructions (required), context

eval: checker (required object), run_tests, test_files, expected_state, needed_commands

checker: command (string or string array), optional workdir, timeout_seconds

needed_commands: array of known dangerous command names (chroot)

Compiled resource visibility

See Tasks & Evaluators for the full EVAL_VISIBILITY table.

Internal task spec (advanced)

Compiled tasks can also be built via task_from_spec() for tests:

{ "id": "...", "benchmark_id": "...", "task_type": "...", "metadata": {}, "resources": { "question": {"value": "...", "visibility": "public"} } }

The removed field expose on resources raises ValueError if present.

Error types

ErrorWhen
ConfigErrorManifest/row/tester parsing
MaterializationErrorUnsafe materialization plans
PathPolicyErrorPath policy violations
Last updated on