Benchmark Schema Reference
Manifest
| Field | Type | Required | Description |
|---|---|---|---|
id | string | yes | Benchmark pack identifier |
version | integer | yes | Pack version (not semver; integer counter) |
defaults.family | string | no | Default row family |
defaults.environment | object | no | Merged into each row |
asset_roots.public | string | no | Default assets/ |
asset_roots.eval | string | no | Default hidden/ |
asset_defaults.read_only | boolean | no | Default true |
Asset root paths must be relative POSIX paths without .. or backslashes.
JSONL row
| Field | Type | Required | Description |
|---|---|---|---|
id | string | yes | Task ID |
family | string | conditional | Required if not in manifest defaults |
input | object | no | Public fields |
eval | object | no | Scoring fields |
assets | object[] | no | Public file assets |
environment | object | no | Overrides manifest environment |
metadata | object | no | Opaque metadata |
Each JSONL line must be a single JSON object. Blank lines are skipped.
Asset object (public assets[])
Common fields used in packs:
| Field | Type | Description |
|---|---|---|
path | string | Relative to asset_roots.public or eval root |
mount | string | Workspace destination path |
read_only | boolean | Override default read-only behavior |
Planned: shared asset-object validation for type, mode, etc. (docs/next.md).
Environment object
| Field | Type | Families | Description |
|---|---|---|---|
image | string | container-backed | Docker image reference |
workdir | string | terminal_task, repo_patch | Container working directory |
timeout_seconds | number | all | Positive timeout |
materialize_workdir_from_image | boolean | terminal_task | Copy image workdir to host workspace before harness |
Family row schemas
Validators reject unknown fields in input and eval for active families.
multiple_choice
input: question (required), choices (required non-empty string array)
eval: answer (required string, number, or string/number array)
short_answer
input: question (required), answer_format, context (string or object)
eval: accepted_answers (required non-empty array), tolerance (non-negative number)
free_response
input: prompt (required), context (string or object)
eval: rubric (required string or object at row validation), reference_answer
Verifier requires rubric object with type: "contains_any", non-empty accepted_answers, optional rejected_answers, optional min_token_f1 (0–1)
code_completion
input: prompt (required), language, starter_code
eval: tests (required object with source: "inline" and code at verify time), reference_solution, canonical_solution
repo_patch
input: repo, base_commit, instructions (required), hints
eval: tests (required command object), gold_patch
tests object: source must be "command"; requires command; optional workdir, timeout_seconds, setup_patch, test_patch, candidate_policy
patch object: source: inline, patch: string
candidate_policy: allow_paths, allow_sensitive_paths (string arrays)
terminal_task
input: instructions (required), context
eval: checker (required object), run_tests, test_files, expected_state, needed_commands
checker: command (string or string array), optional workdir, timeout_seconds
needed_commands: array of known dangerous command names (chroot)
Compiled resource visibility
See Tasks & Evaluators for the full EVAL_VISIBILITY table.
Internal task spec (advanced)
Compiled tasks can also be built via task_from_spec() for tests:
{
"id": "...",
"benchmark_id": "...",
"task_type": "...",
"metadata": {},
"resources": {
"question": {"value": "...", "visibility": "public"}
}
}The removed field expose on resources raises ValueError if present.
Error types
| Error | When |
|---|---|
ConfigError | Manifest/row/tester parsing |
MaterializationError | Unsafe materialization plans |
PathPolicyError | Path policy violations |