Skip to Content
Tester Configuration

Tester Configuration

Tester YAML is the run configuration owned by the benchmark operator. It selects a benchmark pack, a candidate-producing harness, output location, and verifier sandbox policy.

Supported schema version: 0.2

Minimal example

schema_version: "0.2" run: id: my-run-id output_dir: ../../runs/my-run-id benchmark: manifest: manifest.yaml tasks: tasks.jsonl harness: type: codex env: - OPENAI_API_KEY config: model: gpt-5.4-mini version: latest task_file: task.json timeout_seconds: 300

Paths in benchmark and run.output_dir resolve relative to the tester YAML file.

Root fields

FieldRequiredDescription
schema_versionyesMust be "0.2"
runyesRun identity and output directory
benchmarkyesManifest and tasks paths
harnessyesCandidate producer configuration
verificationnoVerifier sandbox policy (defaults are restrictive)

Unknown root fields are rejected with ConfigError.

run section

run: id: code-completion-smoke-codex # non-empty string output_dir: ../../runs/... # created if missing

CLI override: --output-dir PATH replaces output_dir without editing the file.

benchmark section

benchmark: manifest: manifest.yaml tasks: tasks.jsonl

Both paths must exist at run time.

harness section

harness: type: codex # codex | claude_code | command env: # optional list of env var NAMES (not values) - OPENAI_API_KEY config: { } # harness-specific

Harness types

TypePurposeRequires env
commandRun a fixed shell command in the task Docker imageoptional
codexRun OpenAI Codex CLI with tooling overlayOPENAI_API_KEY
claude_codeRun Claude Code CLI with tooling overlayANTHROPIC_API_KEY

Design note: There is no host-execution harness mode. Untrusted candidate production always runs in Docker.

command harness config

harness: type: command config: command: "python solve.py" # string or string array artifact_path: candidate.txt # optional file extraction path task_file: task.json # default: task.json timeout_seconds: 30 allowed_domains: [] # optional egress allowlist
  • Without artifact_path, text families extract from stdout; terminal_task uses workspace extraction.
  • command must not collide with materialized resource paths (reject_task_file_collision).

codex harness config

harness: type: codex env: - OPENAI_API_KEY config: model: gpt-5.4-mini # required version: latest # default: latest task_file: task.json # default: task.json timeout_seconds: 900 # default: 900 allowed_domains: [] # added to api.openai.com

Codex runs inside the task’s benchmark image with:

  • A prebuilt Node-based tooling overlay at /opt/securebench/codex
  • Separate writable Codex home at /opt/securebench/codex-home
  • Public task data materialized into the workspace
  • task.json containing agent_payload() plus extraction instructions

Default candidate extraction follows the family contract (e.g. candidate.py for code, git diff for patches).

For repo_patch tasks, Codex and Claude Code call prepare_repo_patch_baseline() to initialize the git workspace before the agent runs.

claude_code harness config

Same shape as Codex with Claude-specific overlay paths and default egress to api.anthropic.com.

Note: A dedicated smoke tester YAML for claude_code is listed as planned work in the repo’s docs/next.md.

Verification policy

verification: disallow_dangerous_commands: true # default: true deny_commands: [] # optional override list

Controls verifier-only dangerous commands declared by benchmarks (e.g. eval.needed_commands: ["chroot"] on terminal tasks).

SettingBehavior
disallow_dangerous_commands: true (default)Benchmark-requested dangerous commands are denied; verifier fails before running checks
disallow_dangerous_commands: falseAllows known commands; maps to narrow Docker capabilities (e.g. chrootSYS_CHROOT)
deny_commands: [chroot]Explicit deny wins over benchmark declarations and opt-in

Known dangerous commands (dangerous_commands.KNOWN_DANGEROUS_COMMANDS):

CommandDocker capability
chrootSYS_CHROOT

Important: Command policy is guidance and audit logging for executable names — not a general shell interceptor and not a substitute for container isolation.

Example opt-in for Terminal-Bench path-tracing:

verification: disallow_dangerous_commands: false deny_commands: []

Network egress

When allowed_domains is non-empty (or implied by provider harness type), SecureBench:

  1. Creates an internal Docker network
  2. Starts a read-only egress proxy container
  3. Sets HTTP_PROXY / HTTPS_PROXY on the harness container

Domains must be valid DNS names (no IPs, wildcards, or localhost). The proxy script lives at harnesses/egress_proxy.py.

Provider defaults:

  • Codex: api.openai.com
  • Claude Code: api.anthropic.com

Environment variable forwarding

harness.env lists names only. At runtime, values are read from the process environment (after load_env_file()). Missing required vars raise ConfigError before the harness starts.

Security warning: API keys are currently injected into the same container where agent shell commands run. See Security Model.

Task file written for agents

Harnesses write task_file (default task.json) into the agent workspace. The file content is task.agent_payload() only — a JSON object of public resource names to values (for example question, choices, prompt, or instructions).

{ "instructions": "...", "prompt": "..." }

It does not include task metadata, extraction mode, or hidden fields.

For Codex and Claude Code harnesses, extraction instructions are appended to the CLI prompt (codex_prompt / claude_code_prompt), not embedded in task.json. Example prompt tail: “Write the final candidate to candidate.py…”

Example configs in the repo

FileHarnessUse case
benchmarks/code-completion-smoke/tester-codex.yamlcodexCode completion smoke
benchmarks/terminal-task-smoke/tester-command.yamlcommandTerminal task smoke
benchmarks/terminal-bench-first10/tester-codex.yamlcodexFull terminal bench subset

Canonical reference: SecureBench repo docs/tester-yaml-standard.html.

Last updated on