Troubleshooting

Use this page to map common SecureBench failures to the file or setting that usually causes them.

Installation

`securebench: command not found`

Install the package in editable mode from the repo root:


python3 -m pip install -e .

Or invoke the module:


python -m securebench.cli run --config ...

Python version errors

SecureBench requires Python 3.11 or later. Check pyproject.toml.

Configuration errors

`Unsupported tester schema_version`

Tester YAML must use:


schema_version: "0.2"

`benchmark row line N: invalid JSON`

Each tasks.jsonl line must be one complete JSON object. Check trailing commas and unescaped newlines.

`{context}.family is required`

Set defaults.family in manifest.yaml or set family on each row.

`container harness mode requires benchmark environment.image`

Add an image in manifest defaults or the row:


environment:
  image: python:3.11-slim

The image is required for command, codex, and claude_code harnesses and for most verifiers.

Missing API key

Codex harness:


ConfigError: ... OPENAI_API_KEY ...

Check that .env exists and the provider API key is set in the host environment after .env loading. Use --env-file for a non-default dotenv file. Do not rely on harness.env for provider keys, as provider key names are filtered and replaced with dummy keys in the agent container.

Docker failures

`Cannot connect to the Docker daemon`

Start Docker Desktop or the Docker service. Active harnesses require Docker.

Image not found

Build task images for packs such as terminal-bench:


for d in benchmarks/terminal-bench/docker/*; do
  task=$(basename "$d")
  docker build -t "securebench-terminal-bench-${task}:latest" "$d"
done

Image names must match environment.image in task rows.

`Failed to create egress network` or proxy errors

Egress setup runs for provider harnesses and for configs with allowed_domains. Check Docker network permissions and port conflicts.

The egress proxy connects only to allowlisted DNS names on supported ports. It rejects DNS answers that resolve to private, loopback, multicast, or otherwise non-public addresses. Use public provider or API domains. Private service access is unsupported.

Harness and extraction failures

Repo-patch extraction produced an empty patch

repo_patch tasks collect a git diff from the task workdir. Confirm that:

the harness changed files under environment.workdir
repo and base_commit point at the checkout in the benchmark image
the modified paths are not outside the repository

`candidate production timed out`

Increase harness.config.timeout_seconds or row environment.timeout_seconds.

Timeout records have failure_reason: producer_timeout and verification_status: failed.

`harness command failed before producing a workspace candidate`

For terminal_task, workspace extraction requires harness exit code 0. Check producer_stderr in candidates.jsonl.

Verifier failures

`Unknown benchmark family`

The active runtime supports only repo_patch and terminal_task. Removed families such as multiple_choice, short_answer, free_response, and code_completion fail validation.

Dangerous command denied


dangerous verifier commands are disabled by tester policy

Terminal tasks with eval.needed_commands require:


verification:
  disallow_dangerous_commands: false

Remove needed_commands from the benchmark if the checker does not require the command.

`terminal_task task ... requires a produced workspace directory`

The terminal verifier needs a directory path from workspace extraction. Ensure the harness completes successfully and the task family is terminal_task.

Repo patch path policy denial

The candidate patch touched a denied path such as tests, lockfiles, CI, or SecureBench-owned files. Narrow edits to implementation paths or set eval.candidate_policy.allow_paths and allow_sensitive_paths in the row.

`repo_patch task ... requires tests.command as a non-empty string array`

Repo-patch commands must be argv arrays. Replace shell strings like "pytest -q" with:


["python", "-m", "pytest", "-q"]

If a shell is required, make it explicit:


["bash", "-lc", "pytest -q && python check.py"]

Repo patch base commit mismatch

The repo-patch verifier checks that benchmark image HEAD matches input.base_commit. Rebuild the image from the expected checkout or update the task row to the commit baked into the image.

Resume and output

Duplicate or partial results with `--resume`

Resume trusts existing JSONL lines with valid task_id strings. It does not validate records against pack digests.

Delete candidates.jsonl or the output directory for a clean rerun.

Empty `candidates.jsonl`

Check for --limit 0, which is invalid. Use a positive limit or omit the flag. Also check compile errors before the task loop.

Progress and debugging

Goal	Flag
See sandbox I/O	`--show-command-output`
See Codex agent stream	`--show-agent-output`
Silence stderr progress	`--quiet`

Inspect full records:


jq . runs/my-run/candidates.jsonl

Test matching code paths

Run targeted tests that match the failing area:


pytest tests/test_terminal_task_verifier.py -q
pytest tests/test_benchmark_compiler.py -q
pytest tests/test_smoke_benchmark_packs.py -q

Known unresolved areas

See Security model for:

result signing and pack digests
expanded malicious audit packs for repository and terminal workflows

Before filing a bug, check whether the behavior is documented as planned or implemented.

Troubleshooting

Installation

securebench: command not found

Python version errors

Configuration errors

Unsupported tester schema_version

benchmark row line N: invalid JSON

{context}.family is required

container harness mode requires benchmark environment.image