Skip to Content
Benchmark Packs

Benchmark Packs

A benchmark pack is a manifest YAML file plus a JSONL task row file. Benchmark authors define tasks here; SecureBench compiles rows into internal SecureBenchTask objects with visibility-separated resources.

Pack structure

my-benchmark/ ├── manifest.yaml # shared defaults, asset roots ├── tasks.jsonl # one JSON object per line ├── assets/ # public file-backed assets (default) └── hidden/ # evaluation/hidden file-backed assets (default)

Tester YAML references the pack:

benchmark: manifest: manifest.yaml tasks: tasks.jsonl

Paths are resolved relative to the tester YAML file location.

Manifest schema

id: my-benchmark # required, non-empty string version: 1 # required integer defaults: family: code_completion # optional default for rows omitting family environment: # merged into each row's environment image: python:3.11-slim timeout_seconds: 120 asset_roots: public: assets/ # relative path, no .. or backslashes eval: hidden/ asset_defaults: read_only: true # default for file-backed assets

Environment fields (per row or manifest default)

Used by container harnesses and verifiers:

FieldPurpose
imageDocker image for harness and verifier sandboxes
workdirContainer working directory; also workspace mount target for terminal_task
timeout_secondsTask-level timeout budget
materialize_workdir_from_imageterminal_task only: copy image workdir to host workspace before harness

Task row schema

Each JSONL line is one object:

{ "id": "my-benchmark/task-1", "family": "multiple_choice", "input": { }, "eval": { }, "assets": [ ], "environment": { }, "metadata": { } }
FieldRequiredDescription
idyesUnique task identifier
familyif not in manifest defaultsBenchmark family name
inputnoPublic fields → compiled as public resources
evalnoScoring fields → compiled per family visibility map
assetsnoPublic file references (see below)
environmentnoMerged over manifest defaults.environment
metadatanoOpaque metadata preserved on compiled task

Row parsing is strict: duplicate resource names after compilation raise ConfigError.

Assets

Public assets attach files from the pack directory into the agent workspace:

{ "path": "path-tracing/image.ppm", "mount": "image.ppm", "read_only": false }
  • path — relative to asset_roots.public
  • mount — destination path in the workspace (validated by path policy)
  • read_only — optional; defaults to manifest asset_defaults.read_only

Evaluation file references use objects in eval.* with path (under asset_roots.eval) and mount (workspace path). The compiler treats file-reference-shaped eval values as file-backed resources for the appropriate component.

Active benchmark families

These families have schema validators in families/:

FamilyCandidate kindWorkspace requiredVerifier
multiple_choicetextnoyes
short_answertextnoyes
free_responsetextnoyes
code_completioncodenoyes
repo_patchpatchyesyes
terminal_taskworkspaceyesyes

Deferred families (compiler only)

Visibility mappings exist in benchmark_compiler.EVAL_VISIBILITY for:

tool_call, browser_task, desktop_task, artifact_task, multimodal_qa, preference_pair

Rows in these families load and compile but have no verifier until one is registered. Result records show verification_status: "pending".

Family input and eval fields (summary)

multiple_choice

  • input: question, choices[]
  • eval: answer (string, number, or array — hidden)

short_answer

  • input: question, optional context, answer_format
  • eval: accepted_answers[], optional tolerance (hidden)

free_response

  • input: prompt, optional context
  • eval: rubric (required string or object at row validation; verifier requires structured object), optional reference_answer (hidden)

Structured rubric for the current verifier:

"rubric": { "type": "contains_any", "accepted_answers": ["expected phrase or concept"], "rejected_answers": ["forbidden phrase"], "min_token_f1": 1.0 }

Only type: "contains_any" is implemented today.

code_completion

  • input: prompt, optional language (Python only at verify time), optional starter_code
  • eval: tests (required object), optional reference_solution, canonical_solution (hidden)

Tests must use inline source at verification time:

"tests": { "source": "inline", "code": "def check(): ..." }

Inline tests example:

"tests": { "source": "inline", "code": "def check():\n assert add_numbers(2, 3) == 5\n\ncheck()\n" }

repo_patch

  • input: repo, base_commit, instructions, optional hints
  • eval: tests (command-based), optional gold_patch (hidden)

Command tests object:

"tests": { "source": "command", "command": "pytest -q", "workdir": "/testbed", "timeout_seconds": 300, "setup_patch": { "source": "inline", "patch": "..." }, "test_patch": { "source": "inline", "patch": "..." }, "candidate_policy": { "allow_paths": ["src/**"], "allow_sensitive_paths": [] } }

terminal_task

  • input: instructions, optional context
  • eval: checker (required), optional run_tests, test_files, expected_state, needed_commands

Checker object:

"checker": { "command": "bash /app/securebench/evaluation_inputs/run-tests.sh", "workdir": "/app", "timeout_seconds": 180 }

needed_commands lists verifier-only dangerous commands (currently chroot). See Tester Configuration → Verification policy.

Compilation pipeline

BenchmarkRow → validate_benchmark_row_family() # active families only → map input.* → public resources → map assets → public "assets" resource → map eval.* → visibility via eval_visibility_for() → task_from_spec() → SecureBenchTask

Compiled metadata includes:

"benchmark_pack": { "id": "...", "version": 1, "manifest_path": "..." }, "environment": { }, "asset_roots": { "public": "assets/", "eval": "hidden/" }, "asset_defaults": { "read_only": true }

Included example packs

DirectoryFamilyNotes
benchmarks/code-completion-smoke/code_completionMinimal Codex smoke
benchmarks/humaneval-mini/code_completionHumanEval subset
benchmarks/mmlu-abstract-algebra-mini/multiple_choiceMMLU subset
benchmarks/terminal-task-smoke/terminal_taskCommand harness smoke
benchmarks/terminal-bench-first10/terminal_taskReal Terminal-Bench tasks; build images
benchmarks/swe-bench-verified-codex-smoke/repo_patchSWE-bench style

Authoring guidelines

  1. Never put hidden answers in input or public assets.
  2. Keep eval file references under hidden/ (or your configured asset_roots.eval).
  3. Set environment.image for any family that runs in Docker (code completion, repo patch, terminal task).
  4. Use explicit candidate_policy for repo_patch tasks instead of relying only on default deny rules.
  5. For terminal tasks with image-baked starter files, use materialize_workdir_from_image: true and do not store secrets in the image workdir.

Canonical field-level documentation: SecureBench repo docs/benchmark-family-standard.html and docs/benchmark-family-examples.html.

Last updated on