Troubleshooting
Common failures when installing, configuring, and running SecureBench.
Installation
securebench: command not found
Install the package in editable mode from the repo root:
python3 -m pip install -e .Or invoke via module:
python -m securebench.cli run --config ...Python version errors
SecureBench requires Python 3.11+ (pyproject.toml).
Configuration errors
Unsupported tester schema_version
Tester YAML must use schema_version: "0.2".
benchmark row line N: invalid JSON
Each line in tasks.jsonl must be one complete JSON object. Check trailing commas and unescaped newlines.
{context}.family is required
Set defaults.family in manifest.yaml or family on each row.
container harness mode requires benchmark environment.image
Add to manifest defaults or row:
environment:
image: python:3.11-slimRequired for command, codex, and claude_code harnesses and most verifiers.
Missing API key
Codex harness:
ConfigError: ... OPENAI_API_KEY ...Ensure .env exists, variable is set, and name appears in harness.env. Use --env-file if needed.
Docker failures
Cannot connect to the Docker daemon
Start Docker Desktop or the Docker service. All active harnesses require Docker.
Image not found
Build task images for packs like terminal-bench-first10:
for d in benchmarks/terminal-bench-first10/docker/*; do
task=$(basename "$d")
docker build -t "securebench-terminal-bench-first10-${task}:latest" "$d"
doneImage names must match environment.image in task rows.
Failed to create egress network / proxy errors
Egress setup runs when provider harnesses or allowed_domains are configured. Check Docker network permissions and that port conflicts are absent.
Harness and extraction failures
harness did not produce expected candidate file: candidate.py
Code completion tasks expect the agent to write candidate.py in the workspace (default extraction). Codex harness instructs this via task.json extraction instructions.
Fix: verify model follows file output instructions, or use command harness with explicit file creation.
candidate production timed out
Increase harness.config.timeout_seconds or row environment.timeout_seconds.
Timeout records have failure_reason: producer_timeout and verification_status: failed.
harness command failed before producing a workspace candidate
For terminal_task, workspace extraction requires harness exit code 0. Check producer_stderr in JSONL.
Verifier failures
verification_status: pending
No verifier is registered for the task family. Implement and register a verifier, or use an active family.
Deferred families (tool_call, browser_task, etc.) always pending until implemented.
Dangerous command denied
dangerous verifier commands are disabled by tester policyTerminal tasks with eval.needed_commands require:
verification:
disallow_dangerous_commands: falseOr remove needed_commands from the benchmark if possible.
terminal_task task ... requires a produced workspace directory
Terminal verifier needs a directory path from workspace extraction. Ensure harness completes successfully and task family is terminal_task.
Repo patch path policy denial
Candidate patch touched a denied path (tests, lockfiles, etc.). Narrow edits to allowed implementation paths or set tests.candidate_policy.allow_paths / allow_sensitive_paths in the row.
See repo docs/SECURITY.md repo_patch section.
Code completion import failures
Check verifier_stderr in JSONL. Candidate code runs under a supervisor with restored builtins — some attacks may still pass. Treat as known limitation.
Resume and output
Duplicate or partial results with --resume
Resume trusts existing JSONL lines with valid task_id strings. Corrupted or manually edited records are not validated against pack digests.
Delete candidates.jsonl or the output dir for a clean rerun.
Empty candidates.jsonl
Check --limit 0 (not valid — use positive limit or omit). Check compile errors before the task loop.
Progress and debugging
| Goal | Flag |
|---|---|
| See sandbox I/O | --show-command-output |
| See Codex agent stream | --show-agent-output |
| Silence stderr progress | --quiet |
Inspect full records:
jq . runs/my-run/candidates.jsonlGetting help from tests
Run targeted tests mirroring your scenario:
pytest tests/test_terminal_task_verifier.py -q
pytest tests/test_benchmark_compiler.py -q
pytest tests/test_smoke_benchmark_packs.py -qKnown unresolved areas
See SecureBench repo docs/next.md and Security Model for:
- Provider API key isolation
- Result signing and pack digests
- Stronger free-response scoring
- Repo patch post-apply integrity checks
If you hit an issue not covered here, check whether behavior is documented as planned vs implemented before filing a bug.