Architecture
SecureBench is organized around a benchmark pack execution pipeline: load rows, compile tasks with visibility lanes, produce candidates in a harness sandbox, verify in a separate trusted phase, and write JSONL results.
Major components
┌─────────────────────────────────────────────────────────────────┐
│ Tester YAML │
│ (run id, output dir, benchmark paths, harness, verification) │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ benchmark_pack.py Load manifest.yaml + tasks.jsonl │
│ families/registry.py Validate active family row schemas │
│ benchmark_compiler.py Compile rows → SecureBenchTask │
└────────────────────────────┬────────────────────────────────────┘
│
┌──────────────┴──────────────┐
▼ ▼
┌──────────────────────────┐ ┌──────────────────────────┐
│ Harness (producer) │ │ Verifier (post-hoc) │
│ harnesses/* │ │ verifiers/* │
│ candidates/extraction │ │ sandboxes/* │
│ workspaces/materialization│ │ workspaces/materialization│
└────────────┬─────────────┘ └────────────┬─────────────┘
│ │
▼ ▼
CandidateArtifact VerificationResult
│ │
└──────────────┬────────────────┘
▼
tester_run.py → candidates.jsonlPackage layout (active code)
| Package / module | Responsibility |
|---|---|
benchmark_pack.py | Manifest and JSONL row loading |
benchmark_compiler.py | Row → SecureBenchTask; eval field visibility mapping |
tasks.py | Normalized task model; agent_payload(), component views |
resources.py | Resource, ResourceBundle, visibility and redaction |
tester_config.py | Parse tester YAML schema 0.2 |
tester_run.py | End-to-end orchestration, JSONL serialization |
cli.py | securebench run command |
families/ | Family contracts and row schema validators |
harnesses/ | Candidate producers (command, codex, claude_code) |
candidates/ | CandidateArtifact, extraction from stdout/files/git/workspace |
verifiers/ | Family verifiers and registry |
sandboxes/ | HostSandbox, DockerSandbox, PolicySandbox |
workspaces/ | Resource materialization and path policy |
dangerous_commands.py | Verifier-only dangerous command policy |
progress.py | CLI progress events |
Resource visibility model
Every compiled task carries a ResourceBundle. Each resource has a visibility:
| Visibility | Agent | Test sandbox | Evaluator | Result summary |
|---|---|---|---|---|
public | ✓ | ✓ | ✓ | value shown |
evaluation_inputs | ✗ | ✓ | ✗ | redacted |
hidden | ✗ | ✗ | ✓ | redacted |
Note: the evaluator component view includes public and hidden only — not evaluation_inputs. Verifiers that need tests or checkers on disk materialize with component test_sandbox (which includes evaluation_inputs). Methods on SecureBenchTask:
agent_payload()→ agent viewevaluation_payload()→ test_sandbox viewhidden_payload()→ evaluator view (no evaluation_inputs)
Agent rule: harnesses must use task.agent_payload() (alias of public_payload()), not the full task object.
Compiler rule: benchmark_compiler.EVAL_VISIBILITY maps each family’s eval.* fields to a visibility lane. Unknown eval keys default to hidden.
End-to-end task lifecycle
- Load config —
load_tester_config()parses tester YAML;load_env_file()loads secrets. - Load pack —
load_benchmark_pack(manifest, tasks)reads manifest defaults and JSONL rows. - Compile tasks —
compile_benchmark_pack()validates family schemas and buildsSecureBenchTaskobjects. - Build harness —
build_harness_producer()selectsCommandHarnessProducer,CodexHarnessProducer, orClaudeCodeHarnessProducer. - For each task (respecting
--limitand--resume):- Emit progress:
task_start - Reset workspace — delete
workspaces/<task-dir>/from prior runs - Produce candidate
- Materialize agent-visible resources into host workspace (
VisibilityAwareMaterializer, componentagent) - Write
task.jsonwith public payload - Run harness command in
DockerSandbox(task image, optional egress proxy) - Extract candidate via family-specific mode (stdout, file, git diff, workspace)
- Materialize agent-visible resources into host workspace (
- Verify candidate (if verifier registered)
- Pass
verification_policyfrom tester YAML - Verifier materializes trusted inputs and runs checks in a fresh sandbox
- Pass
- Write record — append one JSON line to
candidates.jsonl
- Emit progress:
- Return summary —
TesterRunSummarywith counts and aggregate verification status
Verification status: records vs run summary
Per JSONL record (verification_status field):
| Value | Meaning |
|---|---|
pending | No verifier registered for this family |
passed | Verifier ran and scored success |
failed | Verifier ran and scored failure, or producer timed out |
Run summary (TesterRunSummary.verification_status, printed by CLI):
| Condition | Value |
|---|---|
| Zero tasks | complete |
| No tasks verified | pending |
| All tasks verified | complete |
| Some verified | partial |
Sandboxing layer
DockerSandbox defaults
When not overridden by callers:
network="none"(provider CLI harnesses override via egress policy)cap_drop=("ALL",)with optionalcap_addfor verifier dangerous commandsread_only=Trueon container root filesystemtmpfs=("/tmp",)mem_limit="1g",pids_limit=256security_opt=("no-new-privileges:true",)
Persistent containers use docker run + docker exec; ephemeral mode uses docker run --rm per command.
Harness note: Codex and Claude Code harnesses set read_only=False on the agent container so the workspace is writable. Verifiers typically keep stricter defaults (network="none", read-only root where applicable).
HostSandbox
Used for staging files on the host before bind-mounting into Docker. Path resolution in sandboxes/base.resolve_sandbox_host_path() rejects .., absolute escapes, and symlink-based escapes.
PolicySandbox
Wraps any sandbox with CommandPolicy allow/deny lists by executable name. Used for command auditing; not a substitute for container isolation.
Materialization and path policy
workspaces/materialization.py plans file placement for each component:
- Value-backed resources serialize as JSON under component-specific roots
- Public
assets[]entries can bind files from the benchmark pack directory - File references in
eval.*resolve underasset_roots.eval(defaulthidden/)
workspaces/path_policy.py enforces:
- Mandatory denied paths (
/ground_truth,/eval, etc.) - Component allowed roots (
securebench/public,securebench/evaluation_inputs,securebench/evaluator) - No parent/child mount collisions
- Asset files must resolve under pack asset roots and must not be symlinks
Harness network egress
Provider CLI harnesses (codex, claude_code) need outbound HTTPS to model APIs. SecureBench creates:
- An internal Docker network
- A framework-owned HTTP(S) proxy container with domain allowlisting
- Proxy env vars injected into the harness container
Default allowed domains include provider API hosts plus any harness.config.allowed_domains. Command harnesses use the same mechanism when domains are configured; otherwise network stays none.
Audit and progress
progress.py emits structured events (run_start, task_start, producer_start, sandbox_command, verifier_done, etc.). With --show-agent-output, Codex JSONL events are mirrored to agent-trace.log.
Result records redact hidden resource values in resource_summary and verifier_metadata (exact matches and fields like expected_answer).
What is intentionally out of scope today
- Signed or digest-verified result records
- Provider credential isolation from agent containers (keys are passed via env today)
- Host-based candidate production (all harnesses use Docker)
- Full verifier coverage for deferred families
- Automatic benchmark pack hashing in results
See Security Model and the repo’s docs/next.md for the engineering backlog.
Related reading
- Benchmark Packs — input format before compilation
- Tasks & Evaluators — family contracts and verifier phases
- Tester Configuration — orchestration config