Tester Configuration
Tester YAML is the run configuration owned by the benchmark operator. It selects a benchmark pack, a candidate-producing harness, output location, and verifier sandbox policy.
Supported schema version: 0.2
Minimal example
schema_version: "0.2"
run:
id: my-run-id
output_dir: ../../runs/my-run-id
benchmark:
manifest: manifest.yaml
tasks: tasks.jsonl
harness:
type: codex
env:
- OPENAI_API_KEY
config:
model: gpt-5.4-mini
version: latest
task_file: task.json
timeout_seconds: 300Paths in benchmark and run.output_dir resolve relative to the tester YAML file.
Root fields
| Field | Required | Description |
|---|---|---|
schema_version | yes | Must be "0.2" |
run | yes | Run identity and output directory |
benchmark | yes | Manifest and tasks paths |
harness | yes | Candidate producer configuration |
verification | no | Verifier sandbox policy (defaults are restrictive) |
Unknown root fields are rejected with ConfigError.
run section
run:
id: code-completion-smoke-codex # non-empty string
output_dir: ../../runs/... # created if missingCLI override: --output-dir PATH replaces output_dir without editing the file.
benchmark section
benchmark:
manifest: manifest.yaml
tasks: tasks.jsonlBoth paths must exist at run time.
harness section
harness:
type: codex # codex | claude_code | command
env: # optional list of env var NAMES (not values)
- OPENAI_API_KEY
config: { } # harness-specificHarness types
| Type | Purpose | Requires env |
|---|---|---|
command | Run a fixed shell command in the task Docker image | optional |
codex | Run OpenAI Codex CLI with tooling overlay | OPENAI_API_KEY |
claude_code | Run Claude Code CLI with tooling overlay | ANTHROPIC_API_KEY |
Design note: There is no host-execution harness mode. Untrusted candidate production always runs in Docker.
command harness config
harness:
type: command
config:
command: "python solve.py" # string or string array
artifact_path: candidate.txt # optional file extraction path
task_file: task.json # default: task.json
timeout_seconds: 30
allowed_domains: [] # optional egress allowlist- Without
artifact_path, text families extract from stdout;terminal_taskuses workspace extraction. commandmust not collide with materialized resource paths (reject_task_file_collision).
codex harness config
harness:
type: codex
env:
- OPENAI_API_KEY
config:
model: gpt-5.4-mini # required
version: latest # default: latest
task_file: task.json # default: task.json
timeout_seconds: 900 # default: 900
allowed_domains: [] # added to api.openai.comCodex runs inside the task’s benchmark image with:
- A prebuilt Node-based tooling overlay at
/opt/securebench/codex - Separate writable Codex home at
/opt/securebench/codex-home - Public task data materialized into the workspace
task.jsoncontainingagent_payload()plus extraction instructions
Default candidate extraction follows the family contract (e.g. candidate.py for code, git diff for patches).
For repo_patch tasks, Codex and Claude Code call prepare_repo_patch_baseline() to initialize the git workspace before the agent runs.
claude_code harness config
Same shape as Codex with Claude-specific overlay paths and default egress to api.anthropic.com.
Note: A dedicated smoke tester YAML for claude_code is listed as planned work in the repo’s docs/next.md.
Verification policy
verification:
disallow_dangerous_commands: true # default: true
deny_commands: [] # optional override listControls verifier-only dangerous commands declared by benchmarks (e.g. eval.needed_commands: ["chroot"] on terminal tasks).
| Setting | Behavior |
|---|---|
disallow_dangerous_commands: true (default) | Benchmark-requested dangerous commands are denied; verifier fails before running checks |
disallow_dangerous_commands: false | Allows known commands; maps to narrow Docker capabilities (e.g. chroot → SYS_CHROOT) |
deny_commands: [chroot] | Explicit deny wins over benchmark declarations and opt-in |
Known dangerous commands (dangerous_commands.KNOWN_DANGEROUS_COMMANDS):
| Command | Docker capability |
|---|---|
chroot | SYS_CHROOT |
Important: Command policy is guidance and audit logging for executable names — not a general shell interceptor and not a substitute for container isolation.
Example opt-in for Terminal-Bench path-tracing:
verification:
disallow_dangerous_commands: false
deny_commands: []Network egress
When allowed_domains is non-empty (or implied by provider harness type), SecureBench:
- Creates an internal Docker network
- Starts a read-only egress proxy container
- Sets
HTTP_PROXY/HTTPS_PROXYon the harness container
Domains must be valid DNS names (no IPs, wildcards, or localhost). The proxy script lives at harnesses/egress_proxy.py.
Provider defaults:
- Codex:
api.openai.com - Claude Code:
api.anthropic.com
Environment variable forwarding
harness.env lists names only. At runtime, values are read from the process environment (after load_env_file()). Missing required vars raise ConfigError before the harness starts.
Security warning: API keys are currently injected into the same container where agent shell commands run. See Security Model.
Task file written for agents
Harnesses write task_file (default task.json) into the agent workspace. The file content is task.agent_payload() only — a JSON object of public resource names to values (for example question, choices, prompt, or instructions).
{
"instructions": "...",
"prompt": "..."
}It does not include task metadata, extraction mode, or hidden fields.
For Codex and Claude Code harnesses, extraction instructions are appended to the CLI prompt (codex_prompt / claude_code_prompt), not embedded in task.json. Example prompt tail: “Write the final candidate to candidate.py…”
Example configs in the repo
| File | Harness | Use case |
|---|---|---|
benchmarks/code-completion-smoke/tester-codex.yaml | codex | Code completion smoke |
benchmarks/terminal-task-smoke/tester-command.yaml | command | Terminal task smoke |
benchmarks/terminal-bench-first10/tester-codex.yaml | codex | Full terminal bench subset |
Canonical reference: SecureBench repo docs/tester-yaml-standard.html.
Related reading
- Getting Started — CLI invocation
- Architecture — lifecycle and materialization
- Reference → Tester YAML — complete field reference