Skip to Content
Getting Started

Getting Started

This guide walks through installing SecureBench, running included smoke benchmarks, and reading the outputs.

Prerequisites

  • Python 3.11+
  • Docker — required for all harness runs and for sandbox-backed verifiers (code_completion, repo_patch, terminal_task). Text-family verifiers run in-process on the host.
  • For Codex or Claude Code harnesses: provider API keys in a dotenv file

Install

From the SecureBench repository root:

python3 -m venv .venv .venv/bin/python -m pip install -e .

This installs the securebench CLI entry point and the PyYAML dependency.

Verify the CLI:

.venv/bin/securebench run --help

Configure environment variables

The CLI loads a dotenv file before each run (default: .env in the current working directory).

For Codex harness runs:

OPENAI_API_KEY=sk-...

For Claude Code harness runs:

ANTHROPIC_API_KEY=sk-ant-...

Use --env-file to point at a different file:

.venv/bin/securebench run \ --config benchmarks/code-completion-smoke/tester-codex.yaml \ --env-file path/to/.env

Run a smoke benchmark

Code completion (Codex harness)

Runs candidate production through the Codex CLI inside a Docker container, extracts candidate.py, and verifies with hidden inline tests.

.venv/bin/securebench run \ --config benchmarks/code-completion-smoke/tester-codex.yaml \ --limit 1

Terminal task (command harness)

Runs a tester-provided shell command in the task image, then verifies the final workspace state.

.venv/bin/securebench run \ --config benchmarks/terminal-task-smoke/tester-command.yaml

Other included packs

PackHarnessNotes
benchmarks/humaneval-mini/codexPython code completion
benchmarks/mmlu-abstract-algebra-mini/codexMultiple choice
benchmarks/squad-v1-mini/codexShort answer
benchmarks/truthfulqa-generation-mini/codexFree response
benchmarks/swe-bench-verified-codex-smoke/codexRepo patch (build images; gold patches not in public assets)
benchmarks/terminal-bench-first10/codexTerminal tasks (build Docker images first)

See each pack’s README.md for image build steps and pack-specific configuration.

CLI options

securebench run --config PATH [options]
OptionDescription
--configPath to tester YAML (required)
--env-fileDotenv file to load (default: .env)
--limit NProcess only the first N task rows
--output-dir PATHOverride run.output_dir from tester YAML
--resumeSkip task IDs already present in candidates.jsonl
--quietDisable progress logs on stderr
--show-command-outputInclude sandbox stdout/stderr snippets in progress
--show-agent-outputStream container stdout/stderr during Docker commands (used for Codex JSONL); writes agent-trace.log

On success, the CLI prints a one-line summary:

securebench: run_id=... total=1 verification=complete verified=1 passed=1 output=...

Output layout

Each run writes to the configured run.output_dir (paths in tester YAML are relative to the config file):

runs/code-completion-smoke-codex/ ├── candidates.jsonl # one JSON record per task ├── agent-trace.log # optional, with --show-agent-output └── workspaces/ └── <task-dir>/ # sanitized task id (or id + hash suffix); cleared before each re-run

candidates.jsonl record fields

Each line is a JSON object. Common fields:

FieldMeaning
run_idFrom tester YAML run.id
task_idBenchmark row ID
benchmark_idManifest id
task_typeFamily name (e.g. code_completion)
verification_statusPer task: pending (no verifier), passed, or failed
passedBoolean when verified
scoreTypically 0.0 or 1.0 for binary families
candidate_text / candidate_patch / candidate_workspaceExtracted artifact
producer_stdout, producer_stderrHarness output
producer_metadataHarness-specific metadata
verifier_stdout, verifier_stderrVerifier sandbox output
verifier_metadataScoring details (hidden values redacted)
resource_summaryPublic values plus redacted non-public resource names
hidden_valuesAlways "<redacted>"

When no verifier exists for a family, verification_status is "pending" and pass/score fields are omitted.

The CLI summary line uses a different aggregate field — verification=complete|partial|pending — based on how many tasks had verifiers run across the whole batch (see Architecture).

Resume behavior

--resume re-reads existing candidates.jsonl, keeps valid records, and skips completed task_id values. Important: resumed records are trusted as-is; the framework does not yet verify benchmark pack digests or sign results (see Security Model).

Run tests (development)

From the SecureBench repo:

.venv/bin/python -m pytest -q

Programmatic API

The public Python surface is intentionally small:

from securebench import run_tester_config from securebench.tester_config import load_tester_config config = load_tester_config("benchmarks/code-completion-smoke/tester-codex.yaml") summary = run_tester_config(config, limit=1)

See Reference → Python API for exports and types.

Next steps

Last updated on