Architecture

SecureBench runs benchmark packs through a fixed pipeline: load rows, compile tasks with visibility lanes, produce candidates in a harness sandbox, verify in a trusted phase, and write JSONL records.

Major components

Figure: five-stage pipeline and the agent versus verifier trust boundary.

Tester YAML
Run id, benchmark paths, harness, verification policy
Compile pack
Load manifest and rows, validate schemas, build task resources
Produce candidate
Materialize public inputs, run harness, extract artifact
Verify candidate
Stage evaluation inputs, run verifier in a fresh sandbox
Write results
Append redacted candidate and verification record

Sandbox 1

Candidate production

Seespublic

Harness receives the agent payload only.
No verifier code or evaluation inputs present.

Sandbox 2

Trusted verification

Seesevaluation_inputs

Fresh sandbox per check with evaluation inputs.
Family verifier computes the score.

Candidate production and verification are separate phases. Harnesses receive only the agent-visible payload. Verifiers run later with trusted evaluation inputs or hidden resources, depending on the family.

Package layout

Package / module	Responsibility
`benchmark_pack.py`	Manifest and JSONL row loading
`benchmark_compiler.py`	Row-to-`SecureBenchTask` compilation and eval visibility mapping
`tasks.py`	Normalized task model, component views, and payload helpers
`resources.py`	`Resource`, `ResourceBundle`, visibility, and redaction
`tester_config.py`	Tester YAML schema 0.2 parsing
`tester_run.py`	Orchestration and JSONL serialization
`cli.py`	`securebench run` command
`families/`	Family contracts and row schema validators
`harnesses/`	Candidate producers: `command`, `codex`, `claude_code`
`candidates/`	`CandidateArtifact` and extraction logic
`verifiers/`	Family verifiers and registry
`sandboxes/`	`HostSandbox`, `DockerSandbox`, `PolicySandbox`
`workspaces/`	Resource materialization and path policy
`dangerous_commands.py`	Verifier-only dangerous command policy
`progress.py`	CLI progress events

Resource visibility

Every compiled task carries a ResourceBundle. Each resource has one visibility.

Visibility	Agent	Test sandbox	Evaluator	Result summary
`public`	yes	yes	yes	value shown
`evaluation_inputs`	no	yes	no	redacted
`hidden`	no	no	yes	redacted

The evaluator component view includes public and hidden, not evaluation_inputs. Verifiers that need tests or checkers on disk materialize with component test_sandbox, which includes evaluation_inputs.

SecureBenchTask exposes:

agent_payload() for the agent view
evaluation_payload() for the test sandbox view
hidden_payload() for the evaluator view

Harnesses must use task.agent_payload(), which is an alias of public_payload(). They must not receive the full task object.

benchmark_compiler.EVAL_VISIBILITY maps each family’s eval.* fields to a lane. Active family validators reject unknown eval.* keys before compilation. The compiler’s hidden fallback is an internal defensive default for unmapped families or fields.

Task lifecycle

load_tester_config() parses tester YAML. load_env_file() loads secrets.
load_benchmark_pack(manifest, tasks) reads manifest defaults and JSONL rows.
compile_benchmark_pack() validates family schemas and builds SecureBenchTask objects.
build_harness_producer() selects CommandHarnessProducer, CodexHarnessProducer, or ClaudeCodeHarnessProducer.
For each task, respecting --limit and --resume:
- Emit task_start.
- Delete the per-task workspace directory from prior runs.
- Materialize agent-visible resources into the host workspace with VisibilityAwareMaterializer(component=agent).
- Write task.json with the public payload.
- Run the harness command in DockerSandbox.
- Extract the candidate through the family-specific mode: git_diff or workspace.
- Pass verification_policy to the registered verifier.
- Run verification in a fresh sandbox when a verifier exists.
- Append one JSON object to candidates.jsonl.
Return TesterRunSummary with counts and aggregate verification status.

Verification status

Per JSONL record:

Value	Meaning
`pending`	No verifier registered for this family
`passed`	Verifier ran and scored success
`failed`	Verifier ran and scored failure, or producer timed out

Run summary:

Condition	Value
Zero tasks	`complete`
No tasks verified	`pending`
All tasks verified	`complete`
Some tasks verified	`partial`

Sandboxing

`DockerSandbox`

Defaults when callers do not override them:

network="none"
cap_drop=("ALL",) with optional cap_add for verifier dangerous commands
read_only=True on the container root filesystem
tmpfs=("/tmp",)
mem_limit="1g" and pids_limit=256
security_opt=("no-new-privileges:true",)

Persistent containers use docker run and docker exec. Ephemeral mode uses docker run --rm per command.

Codex and Claude Code harnesses set read_only=False on the agent container so the workspace is writable. Verifiers normally keep stricter defaults: no network and a read-only root where the check allows it.

`HostSandbox`

HostSandbox stages files on the host before bind-mounting them into Docker. sandboxes/base.resolve_sandbox_host_path() rejects .., absolute escapes, and symlink escapes.

`PolicySandbox`

PolicySandbox wraps a sandbox with CommandPolicy allow and deny lists by executable name. It provides command auditing. It is not a container isolation boundary.

Materialization and path policy

workspaces/materialization.py plans file placement for each component:

Value-backed resources serialize as JSON under component-specific roots.
Public assets[] entries bind files from the benchmark pack directory.
File references in eval.* resolve under asset_roots.eval, which defaults to hidden/.

workspaces/path_policy.py enforces:

denied paths such as /ground_truth and /eval
component allowed roots such as securebench/public, securebench/evaluation_inputs, and securebench/evaluator
no parent/child mount collisions
asset files must resolve under pack asset roots and must not be symlinks

Harness network egress

Provider CLI harnesses (codex, claude_code) need outbound HTTPS to model APIs. SecureBench creates:

an internal Docker network
a framework-owned HTTP(S) proxy container with domain allowlisting
proxy environment variables in the harness container

Default allowed domains include provider API hosts plus harness.config.allowed_domains. Command harnesses use the same mechanism when domains are configured. Otherwise network stays none.

The proxy resolves allowlisted hostnames inside the proxy container, rejects DNS answers with non-public or non-unicast addresses, and connects only to validated numeric addresses. An allowlisted public hostname cannot route the harness to private infrastructure.

Audit and progress

progress.py emits structured events such as run_start, task_start, producer_start, sandbox_command, and verifier_done. With --show-agent-output, Codex JSONL events are mirrored to agent-trace.log.

Result records redact hidden resource values in resource_summary and verifier_metadata, including exact matches and fields such as expected_answer.