Tester configuration

Tester YAML is the run configuration owned by the benchmark operator. It selects the benchmark pack, harness, output directory, and verifier sandbox policy.

Supported schema version: 0.2.

Minimal example


schema_version: "0.2"
 
run:
  id: my-run-id
  output_dir: ../../runs/my-run-id
 
benchmark:
  manifest: manifest.yaml
  tasks: tasks.jsonl
 
harness:
  type: codex
  config:
    model: gpt-5.4-mini
    version: latest
    task_file: task.json
    timeout_seconds: 300

Paths in benchmark and run.output_dir resolve relative to the tester YAML file.

Root fields

Field	Required	Description
`schema_version`	yes	Must be `"0.2"`
`run`	yes	Run identity and output directory
`benchmark`	yes	Manifest and task row paths
`harness`	yes	Candidate producer configuration
`verification`	no	Verifier sandbox policy

Unknown root fields raise ConfigError.

`run`


run:
  id: terminal-bench-codex-gpt-5.4-mini
  output_dir: ../../runs/terminal-bench-codex-gpt-5.4-mini

--output-dir PATH overrides output_dir without changing the file.

`benchmark`


benchmark:
  manifest: manifest.yaml
  tasks: tasks.jsonl

Both paths must exist at run time.

`harness`


harness:
  type: codex
  config: { }

env lists additional non-provider environment variable names to forward into the harness container, not values.

Provider harnesses require the provider API key in the host environment after .env loading. harness.env is intended only for additional non-provider variables that should be forwarded, as provider key names are filtered and replaced with dummy keys in the agent container to avoid leakage and unauthorized usage of tools server-side.

Harness types

Type	Purpose	Required host env
`command`	Run a fixed shell command in the task Docker image	none
`codex`	Run OpenAI Codex CLI with the tooling overlay	`OPENAI_API_KEY`
`claude_code`	Run Claude Code CLI with the tooling overlay	`ANTHROPIC_API_KEY`

There is no host-execution harness mode for now. Candidate production runs in Docker.

`command` harness config


harness:
  type: command
  config:
    command: "python solve.py"
    task_file: task.json
    timeout_seconds: 30
    allowed_domains: []

command can be a string or string array.
task_file defaults to task.json.
terminal_task extracts the final workspace.
repo_patch extracts a git diff from the task workdir.
command must not collide with materialized resource paths.

`codex` harness config


harness:
  type: codex
  config:
    model: gpt-5.4-mini
    version: latest
    task_file: task.json
    timeout_seconds: 900
    allowed_domains: []

Codex runs inside the task benchmark image with:

Node-based tooling overlay at /opt/securebench/codex
writable Codex home at /opt/securebench/codex-home
public task data materialized into the workspace
task.json containing agent_payload()
extraction instructions appended to the CLI prompt

Default extraction follows the family contract: git diff for repo_patch, workspace path for terminal_task.

For repo_patch, Codex and Claude Code call prepare_repo_patch_baseline() before the agent runs.

`claude_code` harness config

Claude Code uses the same config shape as Codex with Claude-specific overlay paths and default egress to api.anthropic.com.

Verification policy


verification:
  allow_network: false
  disallow_dangerous_commands: true
  deny_commands: []

This policy controls verifier-only network access and dangerous commands declared by benchmarks, such as eval.needed_commands: ["chroot"] on terminal tasks.

Setting	Behavior
`allow_network: false`	Run verifier sandboxes with Docker network `none`
`allow_network: true`	Run verifier sandboxes with Docker network `bridge`
`disallow_dangerous_commands: true`	Deny benchmark-requested dangerous commands before checks run
`disallow_dangerous_commands: false`	Allow known commands and map them to narrow Docker capabilities
`deny_commands: [chroot]`	Deny list wins over benchmark declarations and opt-in

Known dangerous commands:

Command	Docker capability
`chroot`	`SYS_CHROOT`

Command policy is audit and guidance by executable name. It is not a general shell interceptor and does not replace container isolation.

Example opt-in for Terminal-Bench path-tracing:


verification:
  allow_network: true
  disallow_dangerous_commands: false
  deny_commands: []

Network egress

verification.allow_network applies only to trusted verifier sandboxes.

When allowed_domains is non-empty, or when a provider harness implies egress, SecureBench:

creates an internal Docker network
starts a read-only egress proxy container
sets HTTP_PROXY and HTTPS_PROXY on the harness container

Domains must be DNS names. IPs, wildcards, and localhost are rejected. The proxy script is harnesses/egress_proxy.py.

Provider defaults:

Codex: api.openai.com
Claude Code: api.anthropic.com

The proxy resolves allowlisted hostnames itself and rejects DNS answers containing non-public or non-unicast addresses before connecting. Private destinations are unsupported even when a tester adds the hostname to allowed_domains.

Environment forwarding

At runtime, names listed in harness.env are read from the process environment after load_env_file() runs. Missing forwarded variables raise ConfigError before the harness starts.

Provider harnesses inject dummy API keys into the agent container and replace them through the framework-owned relay. See Security model.

Agent task file

Harnesses write task_file, default task.json, into the agent workspace. The file contains task.agent_payload() only: a JSON object of public resource names to values.


{
  "instructions": "...",
  "context": { }
}

It does not include task metadata, extraction mode, or hidden fields.

Codex and Claude Code receive extraction instructions through the CLI prompt, not through task.json. Repo-patch prompts tell the agent to change the repository. Terminal-task prompts tell the agent to leave final work in the workspace.

Example configs

File	Harness	Use case
`benchmarks/terminal-bench/tester-codex.yaml`	codex	Terminal-Bench subset
`benchmarks/swe-bench-verified/tester-codex.yaml`	codex	SWE-bench Verified subset
`benchmarks/deep-swe/tester-codex.yaml`	codex	Deep SWE-derived repo-patch subset