Skip to Content
Troubleshooting

Troubleshooting

Common failures when installing, configuring, and running SecureBench.

Installation

securebench: command not found

Install the package in editable mode from the repo root:

python3 -m pip install -e .

Or invoke via module:

python -m securebench.cli run --config ...

Python version errors

SecureBench requires Python 3.11+ (pyproject.toml).

Configuration errors

Unsupported tester schema_version

Tester YAML must use schema_version: "0.2".

benchmark row line N: invalid JSON

Each line in tasks.jsonl must be one complete JSON object. Check trailing commas and unescaped newlines.

{context}.family is required

Set defaults.family in manifest.yaml or family on each row.

container harness mode requires benchmark environment.image

Add to manifest defaults or row:

environment: image: python:3.11-slim

Required for command, codex, and claude_code harnesses and most verifiers.

Missing API key

Codex harness:

ConfigError: ... OPENAI_API_KEY ...

Ensure .env exists, variable is set, and name appears in harness.env. Use --env-file if needed.

Docker failures

Cannot connect to the Docker daemon

Start Docker Desktop or the Docker service. All active harnesses require Docker.

Image not found

Build task images for packs like terminal-bench-first10:

for d in benchmarks/terminal-bench-first10/docker/*; do task=$(basename "$d") docker build -t "securebench-terminal-bench-first10-${task}:latest" "$d" done

Image names must match environment.image in task rows.

Failed to create egress network / proxy errors

Egress setup runs when provider harnesses or allowed_domains are configured. Check Docker network permissions and that port conflicts are absent.

Harness and extraction failures

harness did not produce expected candidate file: candidate.py

Code completion tasks expect the agent to write candidate.py in the workspace (default extraction). Codex harness instructs this via task.json extraction instructions.

Fix: verify model follows file output instructions, or use command harness with explicit file creation.

candidate production timed out

Increase harness.config.timeout_seconds or row environment.timeout_seconds.

Timeout records have failure_reason: producer_timeout and verification_status: failed.

harness command failed before producing a workspace candidate

For terminal_task, workspace extraction requires harness exit code 0. Check producer_stderr in JSONL.

Verifier failures

verification_status: pending

No verifier is registered for the task family. Implement and register a verifier, or use an active family.

Deferred families (tool_call, browser_task, etc.) always pending until implemented.

Dangerous command denied

dangerous verifier commands are disabled by tester policy

Terminal tasks with eval.needed_commands require:

verification: disallow_dangerous_commands: false

Or remove needed_commands from the benchmark if possible.

terminal_task task ... requires a produced workspace directory

Terminal verifier needs a directory path from workspace extraction. Ensure harness completes successfully and task family is terminal_task.

Repo patch path policy denial

Candidate patch touched a denied path (tests, lockfiles, etc.). Narrow edits to allowed implementation paths or set tests.candidate_policy.allow_paths / allow_sensitive_paths in the row.

See repo docs/SECURITY.md repo_patch section.

Code completion import failures

Check verifier_stderr in JSONL. Candidate code runs under a supervisor with restored builtins — some attacks may still pass. Treat as known limitation.

Resume and output

Duplicate or partial results with --resume

Resume trusts existing JSONL lines with valid task_id strings. Corrupted or manually edited records are not validated against pack digests.

Delete candidates.jsonl or the output dir for a clean rerun.

Empty candidates.jsonl

Check --limit 0 (not valid — use positive limit or omit). Check compile errors before the task loop.

Progress and debugging

GoalFlag
See sandbox I/O--show-command-output
See Codex agent stream--show-agent-output
Silence stderr progress--quiet

Inspect full records:

jq . runs/my-run/candidates.jsonl

Getting help from tests

Run targeted tests mirroring your scenario:

pytest tests/test_terminal_task_verifier.py -q pytest tests/test_benchmark_compiler.py -q pytest tests/test_smoke_benchmark_packs.py -q

Known unresolved areas

See SecureBench repo docs/next.md and Security Model for:

  • Provider API key isolation
  • Result signing and pack digests
  • Stronger free-response scoring
  • Repo patch post-apply integrity checks

If you hit an issue not covered here, check whether behavior is documented as planned vs implemented before filing a bug.

Last updated on