Getting Started
This guide walks through installing SecureBench, running included smoke benchmarks, and reading the outputs.
Prerequisites
- Python 3.11+
- Docker — required for all harness runs and for sandbox-backed verifiers (
code_completion,repo_patch,terminal_task). Text-family verifiers run in-process on the host. - For Codex or Claude Code harnesses: provider API keys in a dotenv file
Install
From the SecureBench repository root:
python3 -m venv .venv
.venv/bin/python -m pip install -e .This installs the securebench CLI entry point and the PyYAML dependency.
Verify the CLI:
.venv/bin/securebench run --helpConfigure environment variables
The CLI loads a dotenv file before each run (default: .env in the current working directory).
For Codex harness runs:
OPENAI_API_KEY=sk-...For Claude Code harness runs:
ANTHROPIC_API_KEY=sk-ant-...Use --env-file to point at a different file:
.venv/bin/securebench run \
--config benchmarks/code-completion-smoke/tester-codex.yaml \
--env-file path/to/.envRun a smoke benchmark
Code completion (Codex harness)
Runs candidate production through the Codex CLI inside a Docker container, extracts candidate.py, and verifies with hidden inline tests.
.venv/bin/securebench run \
--config benchmarks/code-completion-smoke/tester-codex.yaml \
--limit 1Terminal task (command harness)
Runs a tester-provided shell command in the task image, then verifies the final workspace state.
.venv/bin/securebench run \
--config benchmarks/terminal-task-smoke/tester-command.yamlOther included packs
| Pack | Harness | Notes |
|---|---|---|
benchmarks/humaneval-mini/ | codex | Python code completion |
benchmarks/mmlu-abstract-algebra-mini/ | codex | Multiple choice |
benchmarks/squad-v1-mini/ | codex | Short answer |
benchmarks/truthfulqa-generation-mini/ | codex | Free response |
benchmarks/swe-bench-verified-codex-smoke/ | codex | Repo patch (build images; gold patches not in public assets) |
benchmarks/terminal-bench-first10/ | codex | Terminal tasks (build Docker images first) |
See each pack’s README.md for image build steps and pack-specific configuration.
CLI options
securebench run --config PATH [options]| Option | Description |
|---|---|
--config | Path to tester YAML (required) |
--env-file | Dotenv file to load (default: .env) |
--limit N | Process only the first N task rows |
--output-dir PATH | Override run.output_dir from tester YAML |
--resume | Skip task IDs already present in candidates.jsonl |
--quiet | Disable progress logs on stderr |
--show-command-output | Include sandbox stdout/stderr snippets in progress |
--show-agent-output | Stream container stdout/stderr during Docker commands (used for Codex JSONL); writes agent-trace.log |
On success, the CLI prints a one-line summary:
securebench: run_id=... total=1 verification=complete verified=1 passed=1 output=...Output layout
Each run writes to the configured run.output_dir (paths in tester YAML are relative to the config file):
runs/code-completion-smoke-codex/
├── candidates.jsonl # one JSON record per task
├── agent-trace.log # optional, with --show-agent-output
└── workspaces/
└── <task-dir>/ # sanitized task id (or id + hash suffix); cleared before each re-runcandidates.jsonl record fields
Each line is a JSON object. Common fields:
| Field | Meaning |
|---|---|
run_id | From tester YAML run.id |
task_id | Benchmark row ID |
benchmark_id | Manifest id |
task_type | Family name (e.g. code_completion) |
verification_status | Per task: pending (no verifier), passed, or failed |
passed | Boolean when verified |
score | Typically 0.0 or 1.0 for binary families |
candidate_text / candidate_patch / candidate_workspace | Extracted artifact |
producer_stdout, producer_stderr | Harness output |
producer_metadata | Harness-specific metadata |
verifier_stdout, verifier_stderr | Verifier sandbox output |
verifier_metadata | Scoring details (hidden values redacted) |
resource_summary | Public values plus redacted non-public resource names |
hidden_values | Always "<redacted>" |
When no verifier exists for a family, verification_status is "pending" and pass/score fields are omitted.
The CLI summary line uses a different aggregate field — verification=complete|partial|pending — based on how many tasks had verifiers run across the whole batch (see Architecture).
Resume behavior
--resume re-reads existing candidates.jsonl, keeps valid records, and skips completed task_id values. Important: resumed records are trusted as-is; the framework does not yet verify benchmark pack digests or sign results (see Security Model).
Run tests (development)
From the SecureBench repo:
.venv/bin/python -m pytest -qProgrammatic API
The public Python surface is intentionally small:
from securebench import run_tester_config
from securebench.tester_config import load_tester_config
config = load_tester_config("benchmarks/code-completion-smoke/tester-codex.yaml")
summary = run_tester_config(config, limit=1)See Reference → Python API for exports and types.
Next steps
- Architecture — how components fit together
- Benchmark Packs — author task rows
- Tester Configuration — configure harnesses and verification policy