Benchmark packs

A benchmark pack is a manifest YAML file plus a JSONL task row file. SecureBench supports two agentic families: repo_patch and terminal_task.

Pack structure


my-benchmark/
├── manifest.yaml       # shared defaults and asset roots
├── tasks.jsonl         # one JSON object per line
├── assets/             # public file-backed assets
└── hidden/             # evaluation and hidden file-backed assets

Tester YAML points to the pack:


benchmark:
  manifest: manifest.yaml
  tasks: tasks.jsonl

Paths resolve relative to the tester YAML file.

Manifest


id: my-benchmark
version: 1
 
defaults:
  family: terminal_task
  environment:
    image: my-benchmark-image:latest
    workdir: /workspace
    timeout_seconds: 300
 
asset_roots:
  public: assets/
  eval: hidden/
 
asset_defaults:
  read_only: true

defaults.family can be terminal_task or repo_patch. environment.image is required for harness and verifier containers. environment.workdir is required for repo_patch checks and is the workspace mount target for terminal_task.

asset_roots sets the pack directories used for public and verifier-only resources.

Task rows

Each JSONL line is one object:


{
  "id": "my-benchmark/task-1",
  "family": "terminal_task",
  "input": { },
  "eval": { },
  "assets": [ ],
  "environment": { },
  "metadata": { }
}

Field	Required	Description
`id`	yes	Unique task identifier
`family`	if not in manifest defaults	`terminal_task` or `repo_patch`
`input`	yes	Public fields compiled as `public` resources
`eval`	yes	Verifier fields compiled as `evaluation_inputs` or `hidden`
`assets`	no	Public file references
`environment`	no	Merged over manifest `defaults.environment`
`metadata`	no	Opaque metadata preserved on the compiled task

Supported families are strict. Unknown family names and unknown fields raise ConfigError.

Assets

Public assets attach files from the pack directory into the agent workspace:


{
  "path": "starter/input.txt",
  "mount": "input.txt",
  "read_only": false
}

path is relative to asset_roots.public.
mount is the destination path in the workspace. It defaults to path when omitted.
read_only defaults to manifest asset_defaults.read_only.

Evaluation file references use objects in eval.* with path under asset_roots.eval. SecureBench rejects path traversal, absolute paths, and symlinks.

Active families

Family	Candidate kind	Workspace required	Verifier
`repo_patch`	patch	yes	yes
`terminal_task`	workspace	yes	yes

Additional families can be added as the benchmark surface grows. Unsupported family names fail validation.

Adapting existing benchmarks

SecureBench pack conversion starts with three questions:

What can the agent see?
What workspace or repository state does the agent produce?
What trusted input scores that state?

For SWE-Bench-style datasets, public fields such as repository, base commit, and problem statement become input. Hidden patches and test commands become eval. The benchmark image and workdir become environment.

Deep SWE maps from task directories into the same structure: instruction.md becomes public instructions; task metadata supplies repository, commit, image, and timeout; solution and test patches stay under eval.

Terminal-Bench uses terminal_task. Public starter files are mounted as assets, hidden pytest or script checkers live under the eval asset root, and image-backed starter state can be materialized with materialize_workdir_from_image.

Audit packs use the same manifest and row mechanics. They test framework defenses against tampering, hidden-file overwrite attempts, fake test wrappers, network probes, and candidate policy violations.

`repo_patch`

Repo-patch tasks ask an agent to edit a repository checkout. SecureBench extracts a canonical git diff and verifies it in the benchmark image.

input: repo, base_commit, instructions, optional hints
eval: tests, optional candidate_policy, optional gold_patch


"input": {
  "repo": "example/repo",
  "base_commit": "abc123",
  "instructions": "Fix the parser bug."
},
"eval": {
  "tests": {
    "source": "command",
    "command": ["python", "-m", "pytest", "tests/test_bug.py"],
    "workdir": "/testbed",
    "timeout_seconds": 300,
    "setup_patch": { "source": "inline", "patch": "..." },
    "test_patch": { "source": "inline", "patch": "..." }
  },
  "candidate_policy": {
    "allow_paths": ["src/**"],
    "allow_sensitive_paths": [],
    "patch_preserved_paths": ["tests/public_helpers.py"]
  }
}

tests.command must be an argv array. Use an explicit shell only when required, for example ["bash", "-lc", "pytest -q && python check.py"].

candidate_policy is verifier-only evaluation input:

allow_paths declares the intended implementation edit surface. If omitted, the verifier allows non-sensitive paths.
allow_sensitive_paths permits otherwise denied paths for tasks that require them. If omitted, sensitive-by-default paths remain blocked.
patch_preserved_paths strips candidate edits to visible verifier-supporting files before verification.

Do not use patch_preserved_paths to hide tests. It is for visible helper files that the agent may need to read or act on, but that must remain unchanged for evaluation. Hidden or evaluation tests belong in eval.tests.setup_patch, eval.tests.test_patch, or eval file assets.

`terminal_task`

Terminal tasks ask an agent to change files or produce artifacts in a workspace. SecureBench verifies the final workspace state with trusted checker files.

input: instructions, optional context
eval: checker, optional run_tests, test_files, expected_state, needed_commands


"input": {
  "instructions": "Write output.txt with the expected marker."
},
"eval": {
  "checker": {
    "source": "pytest",
    "path": "checks",
    "timeout_seconds": 180
  }
}

Checker source is pytest or script. path is a safe relative path under asset_roots.eval. SecureBench mounts it read-only under /opt/securebench/evaluator.

needed_commands lists verifier-only dangerous commands. The current supported command is chroot. Dangerous commands are denied unless tester YAML sets verification.disallow_dangerous_commands: false.

Example packs

Directory	Family	Notes
`benchmarks/terminal-bench/`	`terminal_task`	Terminal-Bench subset
`benchmarks/swe-bench-verified/`	`repo_patch`	SWE-bench Verified subset
`benchmarks/deep-swe/`	`repo_patch`	Deep SWE-derived subset
`benchmarks/audit/terminal-task/`	`terminal_task`	Terminal-task robustness probes
`benchmarks/audit/repo-patch/`	`repo_patch`	Repo-patch robustness probes

Authoring requirements

Put only task instructions and public assets in input and assets.
Keep checker files, test patches, and verifier-only files under asset_roots.eval.
Set environment.image for every pack.
Set environment.workdir for repo_patch packs and terminal packs whose workspace path matters.
Use explicit eval.candidate_policy.allow_paths for repo_patch tasks.
Treat all candidate workspace files as hostile inside terminal checkers.