Benchmark Packs
A benchmark pack is a manifest YAML file plus a JSONL task row file. Benchmark authors define tasks here; SecureBench compiles rows into internal SecureBenchTask objects with visibility-separated resources.
Pack structure
my-benchmark/
├── manifest.yaml # shared defaults, asset roots
├── tasks.jsonl # one JSON object per line
├── assets/ # public file-backed assets (default)
└── hidden/ # evaluation/hidden file-backed assets (default)Tester YAML references the pack:
benchmark:
manifest: manifest.yaml
tasks: tasks.jsonlPaths are resolved relative to the tester YAML file location.
Manifest schema
id: my-benchmark # required, non-empty string
version: 1 # required integer
defaults:
family: code_completion # optional default for rows omitting family
environment: # merged into each row's environment
image: python:3.11-slim
timeout_seconds: 120
asset_roots:
public: assets/ # relative path, no .. or backslashes
eval: hidden/
asset_defaults:
read_only: true # default for file-backed assetsEnvironment fields (per row or manifest default)
Used by container harnesses and verifiers:
| Field | Purpose |
|---|---|
image | Docker image for harness and verifier sandboxes |
workdir | Container working directory; also workspace mount target for terminal_task |
timeout_seconds | Task-level timeout budget |
materialize_workdir_from_image | terminal_task only: copy image workdir to host workspace before harness |
Task row schema
Each JSONL line is one object:
{
"id": "my-benchmark/task-1",
"family": "multiple_choice",
"input": { },
"eval": { },
"assets": [ ],
"environment": { },
"metadata": { }
}| Field | Required | Description |
|---|---|---|
id | yes | Unique task identifier |
family | if not in manifest defaults | Benchmark family name |
input | no | Public fields → compiled as public resources |
eval | no | Scoring fields → compiled per family visibility map |
assets | no | Public file references (see below) |
environment | no | Merged over manifest defaults.environment |
metadata | no | Opaque metadata preserved on compiled task |
Row parsing is strict: duplicate resource names after compilation raise ConfigError.
Assets
Public assets attach files from the pack directory into the agent workspace:
{
"path": "path-tracing/image.ppm",
"mount": "image.ppm",
"read_only": false
}path— relative toasset_roots.publicmount— destination path in the workspace (validated by path policy)read_only— optional; defaults to manifestasset_defaults.read_only
Evaluation file references use objects in eval.* with path (under asset_roots.eval) and mount (workspace path). The compiler treats file-reference-shaped eval values as file-backed resources for the appropriate component.
Active benchmark families
These families have schema validators in families/:
| Family | Candidate kind | Workspace required | Verifier |
|---|---|---|---|
multiple_choice | text | no | yes |
short_answer | text | no | yes |
free_response | text | no | yes |
code_completion | code | no | yes |
repo_patch | patch | yes | yes |
terminal_task | workspace | yes | yes |
Deferred families (compiler only)
Visibility mappings exist in benchmark_compiler.EVAL_VISIBILITY for:
tool_call, browser_task, desktop_task, artifact_task, multimodal_qa, preference_pair
Rows in these families load and compile but have no verifier until one is registered. Result records show verification_status: "pending".
Family input and eval fields (summary)
multiple_choice
- input:
question,choices[] - eval:
answer(string, number, or array — hidden)
short_answer
- input:
question, optionalcontext,answer_format - eval:
accepted_answers[], optionaltolerance(hidden)
free_response
- input:
prompt, optionalcontext - eval:
rubric(required string or object at row validation; verifier requires structured object), optionalreference_answer(hidden)
Structured rubric for the current verifier:
"rubric": {
"type": "contains_any",
"accepted_answers": ["expected phrase or concept"],
"rejected_answers": ["forbidden phrase"],
"min_token_f1": 1.0
}Only type: "contains_any" is implemented today.
code_completion
- input:
prompt, optionallanguage(Python only at verify time), optionalstarter_code - eval:
tests(required object), optionalreference_solution,canonical_solution(hidden)
Tests must use inline source at verification time:
"tests": { "source": "inline", "code": "def check(): ..." }Inline tests example:
"tests": {
"source": "inline",
"code": "def check():\n assert add_numbers(2, 3) == 5\n\ncheck()\n"
}repo_patch
- input:
repo,base_commit,instructions, optionalhints - eval:
tests(command-based), optionalgold_patch(hidden)
Command tests object:
"tests": {
"source": "command",
"command": "pytest -q",
"workdir": "/testbed",
"timeout_seconds": 300,
"setup_patch": { "source": "inline", "patch": "..." },
"test_patch": { "source": "inline", "patch": "..." },
"candidate_policy": {
"allow_paths": ["src/**"],
"allow_sensitive_paths": []
}
}terminal_task
- input:
instructions, optionalcontext - eval:
checker(required), optionalrun_tests,test_files,expected_state,needed_commands
Checker object:
"checker": {
"command": "bash /app/securebench/evaluation_inputs/run-tests.sh",
"workdir": "/app",
"timeout_seconds": 180
}needed_commands lists verifier-only dangerous commands (currently chroot). See Tester Configuration → Verification policy.
Compilation pipeline
BenchmarkRow
→ validate_benchmark_row_family() # active families only
→ map input.* → public resources
→ map assets → public "assets" resource
→ map eval.* → visibility via eval_visibility_for()
→ task_from_spec() → SecureBenchTaskCompiled metadata includes:
"benchmark_pack": {
"id": "...",
"version": 1,
"manifest_path": "..."
},
"environment": { },
"asset_roots": { "public": "assets/", "eval": "hidden/" },
"asset_defaults": { "read_only": true }Included example packs
| Directory | Family | Notes |
|---|---|---|
benchmarks/code-completion-smoke/ | code_completion | Minimal Codex smoke |
benchmarks/humaneval-mini/ | code_completion | HumanEval subset |
benchmarks/mmlu-abstract-algebra-mini/ | multiple_choice | MMLU subset |
benchmarks/terminal-task-smoke/ | terminal_task | Command harness smoke |
benchmarks/terminal-bench-first10/ | terminal_task | Real Terminal-Bench tasks; build images |
benchmarks/swe-bench-verified-codex-smoke/ | repo_patch | SWE-bench style |
Authoring guidelines
- Never put hidden answers in
inputor public assets. - Keep eval file references under
hidden/(or your configuredasset_roots.eval). - Set
environment.imagefor any family that runs in Docker (code completion, repo patch, terminal task). - Use explicit
candidate_policyfor repo_patch tasks instead of relying only on default deny rules. - For terminal tasks with image-baked starter files, use
materialize_workdir_from_image: trueand do not store secrets in the image workdir.
Canonical field-level documentation: SecureBench repo docs/benchmark-family-standard.html and docs/benchmark-family-examples.html.
Related reading
- Tasks & Evaluators — candidate extraction and verification behavior
- Reference → Benchmark schema — field tables and validation rules