Getting started

Install SecureBench, run a benchmark pack, and inspect the result files.

Prerequisites

Python 3.11+
Docker for harness runs and verifier sandboxes
Provider API keys in a dotenv file when using Codex or Claude Code harnesses

Install

Run from the SecureBench repository root:


python3 -m venv .venv
.venv/bin/python -m pip install -e .

Check the CLI:


.venv/bin/securebench run --help

Configure environment variables

The CLI loads .env by default.

Codex harness:


OPENAI_API_KEY=sk-...

Claude Code harness:


ANTHROPIC_API_KEY=sk-ant-...

Use --env-file to load a different dotenv file.

Run a benchmark

Terminal task

This runs a Terminal-Bench task through the Codex harness and verifies the final workspace state.


.venv/bin/securebench run \
  --config benchmarks/terminal-bench/tester-codex.yaml \
  --limit 1

Repo patch

This runs an agentic repository-editing pack, extracts a git diff, and verifies it in the benchmark image.


.venv/bin/securebench run \
  --config benchmarks/deep-swe/tester-codex.yaml \
  --limit 1

Included packs

Pack	Family	Harness	Notes
`benchmarks/terminal-bench/`	`terminal_task`	codex	Terminal-Bench subset
`benchmarks/swe-bench-verified/`	`repo_patch`	codex	SWE-bench Verified subset
`benchmarks/deep-swe/`	`repo_patch`	codex	Deep SWE-derived subset
`benchmarks/audit/terminal-task/`	`terminal_task`	audit tooling	Framework robustness probes
`benchmarks/audit/repo-patch/`	`repo_patch`	audit tooling	Repository-tamper probes

Each pack includes image build steps and pack-specific configuration.

CLI shape


securebench run --config PATH [options]

Option	Description
`--config`	Tester YAML path. Required.
`--env-file`	Dotenv file to load. Default: `.env`.
`--limit N`	Process the first `N` task rows.
`--output-dir PATH`	Override `run.output_dir` from tester YAML.
`--resume`	Skip task IDs already present in `candidates.jsonl`.
`--quiet`	Disable progress logs on stderr.
`--show-command-output`	Include sandbox stdout/stderr snippets in progress output.
`--show-agent-output`	Stream provider CLI output and write `agent-trace.log`.

Successful runs print:


securebench: run_id=... total=1 verification=complete verified=1 passed=1 output=...

Output layout

Each run writes to run.output_dir:


runs/<run-id>/
├── candidates.jsonl
├── agent-trace.log          # optional, with --show-agent-output
└── workspaces/
    └── <task-dir>/

`candidates.jsonl`

Each line is one JSON object.

Field	Meaning
`run_id`	Tester YAML `run.id`
`task_id`	Benchmark row ID
`benchmark_id`	Manifest `id`
`task_type`	`repo_patch` or `terminal_task`
`verification_status`	Per task: `pending`, `passed`, or `failed`
`passed`	Boolean when verification ran
`score`	Usually `0.0` or `1.0`
`candidate_patch` / `candidate_workspace`	Extracted artifact
`producer_stdout`, `producer_stderr`	Harness output
`producer_metadata`	Harness and extraction metadata
`verifier_stdout`, `verifier_stderr`	Verifier sandbox output
`verifier_metadata`	Scoring details with hidden values redacted
`resource_summary`	Public values plus redacted non-public resource names
`hidden_values`	Always `"<redacted>"`

The CLI summary reports aggregate verification as complete, partial, or pending.

Resume behavior

--resume reads existing candidates.jsonl, keeps valid records, and skips completed task_id values. Resumed records are trusted as-is. SecureBench does not yet verify benchmark pack digests or sign result records.

Run tests

Run from the SecureBench repo:


.venv/bin/python -m pytest -q

Programmatic API


from securebench import run_tester_config
from securebench.tester_config import load_tester_config
 
config = load_tester_config("benchmarks/terminal-bench/tester-codex.yaml")
summary = run_tester_config(config, limit=1)

See Reference: Python API.