Skip to Content
Architecture

Architecture

SecureBench is organized around a benchmark pack execution pipeline: load rows, compile tasks with visibility lanes, produce candidates in a harness sandbox, verify in a separate trusted phase, and write JSONL results.

Major components

┌─────────────────────────────────────────────────────────────────┐ │ Tester YAML │ │ (run id, output dir, benchmark paths, harness, verification) │ └────────────────────────────┬────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────┐ │ benchmark_pack.py Load manifest.yaml + tasks.jsonl │ │ families/registry.py Validate active family row schemas │ │ benchmark_compiler.py Compile rows → SecureBenchTask │ └────────────────────────────┬────────────────────────────────────┘ ┌──────────────┴──────────────┐ ▼ ▼ ┌──────────────────────────┐ ┌──────────────────────────┐ │ Harness (producer) │ │ Verifier (post-hoc) │ │ harnesses/* │ │ verifiers/* │ │ candidates/extraction │ │ sandboxes/* │ │ workspaces/materialization│ │ workspaces/materialization│ └────────────┬─────────────┘ └────────────┬─────────────┘ │ │ ▼ ▼ CandidateArtifact VerificationResult │ │ └──────────────┬────────────────┘ tester_run.py → candidates.jsonl

Package layout (active code)

Package / moduleResponsibility
benchmark_pack.pyManifest and JSONL row loading
benchmark_compiler.pyRow → SecureBenchTask; eval field visibility mapping
tasks.pyNormalized task model; agent_payload(), component views
resources.pyResource, ResourceBundle, visibility and redaction
tester_config.pyParse tester YAML schema 0.2
tester_run.pyEnd-to-end orchestration, JSONL serialization
cli.pysecurebench run command
families/Family contracts and row schema validators
harnesses/Candidate producers (command, codex, claude_code)
candidates/CandidateArtifact, extraction from stdout/files/git/workspace
verifiers/Family verifiers and registry
sandboxes/HostSandbox, DockerSandbox, PolicySandbox
workspaces/Resource materialization and path policy
dangerous_commands.pyVerifier-only dangerous command policy
progress.pyCLI progress events

Resource visibility model

Every compiled task carries a ResourceBundle. Each resource has a visibility:

VisibilityAgentTest sandboxEvaluatorResult summary
publicvalue shown
evaluation_inputsredacted
hiddenredacted

Note: the evaluator component view includes public and hidden only — not evaluation_inputs. Verifiers that need tests or checkers on disk materialize with component test_sandbox (which includes evaluation_inputs). Methods on SecureBenchTask:

  • agent_payload() → agent view
  • evaluation_payload() → test_sandbox view
  • hidden_payload() → evaluator view (no evaluation_inputs)

Agent rule: harnesses must use task.agent_payload() (alias of public_payload()), not the full task object.

Compiler rule: benchmark_compiler.EVAL_VISIBILITY maps each family’s eval.* fields to a visibility lane. Unknown eval keys default to hidden.

End-to-end task lifecycle

  1. Load configload_tester_config() parses tester YAML; load_env_file() loads secrets.
  2. Load packload_benchmark_pack(manifest, tasks) reads manifest defaults and JSONL rows.
  3. Compile taskscompile_benchmark_pack() validates family schemas and builds SecureBenchTask objects.
  4. Build harnessbuild_harness_producer() selects CommandHarnessProducer, CodexHarnessProducer, or ClaudeCodeHarnessProducer.
  5. For each task (respecting --limit and --resume):
    • Emit progress: task_start
    • Reset workspace — delete workspaces/<task-dir>/ from prior runs
    • Produce candidate
      • Materialize agent-visible resources into host workspace (VisibilityAwareMaterializer, component agent)
      • Write task.json with public payload
      • Run harness command in DockerSandbox (task image, optional egress proxy)
      • Extract candidate via family-specific mode (stdout, file, git diff, workspace)
    • Verify candidate (if verifier registered)
      • Pass verification_policy from tester YAML
      • Verifier materializes trusted inputs and runs checks in a fresh sandbox
    • Write record — append one JSON line to candidates.jsonl
  6. Return summaryTesterRunSummary with counts and aggregate verification status

Verification status: records vs run summary

Per JSONL record (verification_status field):

ValueMeaning
pendingNo verifier registered for this family
passedVerifier ran and scored success
failedVerifier ran and scored failure, or producer timed out

Run summary (TesterRunSummary.verification_status, printed by CLI):

ConditionValue
Zero taskscomplete
No tasks verifiedpending
All tasks verifiedcomplete
Some verifiedpartial

Sandboxing layer

DockerSandbox defaults

When not overridden by callers:

  • network="none" (provider CLI harnesses override via egress policy)
  • cap_drop=("ALL",) with optional cap_add for verifier dangerous commands
  • read_only=True on container root filesystem
  • tmpfs=("/tmp",)
  • mem_limit="1g", pids_limit=256
  • security_opt=("no-new-privileges:true",)

Persistent containers use docker run + docker exec; ephemeral mode uses docker run --rm per command.

Harness note: Codex and Claude Code harnesses set read_only=False on the agent container so the workspace is writable. Verifiers typically keep stricter defaults (network="none", read-only root where applicable).

HostSandbox

Used for staging files on the host before bind-mounting into Docker. Path resolution in sandboxes/base.resolve_sandbox_host_path() rejects .., absolute escapes, and symlink-based escapes.

PolicySandbox

Wraps any sandbox with CommandPolicy allow/deny lists by executable name. Used for command auditing; not a substitute for container isolation.

Materialization and path policy

workspaces/materialization.py plans file placement for each component:

  • Value-backed resources serialize as JSON under component-specific roots
  • Public assets[] entries can bind files from the benchmark pack directory
  • File references in eval.* resolve under asset_roots.eval (default hidden/)

workspaces/path_policy.py enforces:

  • Mandatory denied paths (/ground_truth, /eval, etc.)
  • Component allowed roots (securebench/public, securebench/evaluation_inputs, securebench/evaluator)
  • No parent/child mount collisions
  • Asset files must resolve under pack asset roots and must not be symlinks

Harness network egress

Provider CLI harnesses (codex, claude_code) need outbound HTTPS to model APIs. SecureBench creates:

  1. An internal Docker network
  2. A framework-owned HTTP(S) proxy container with domain allowlisting
  3. Proxy env vars injected into the harness container

Default allowed domains include provider API hosts plus any harness.config.allowed_domains. Command harnesses use the same mechanism when domains are configured; otherwise network stays none.

Audit and progress

progress.py emits structured events (run_start, task_start, producer_start, sandbox_command, verifier_done, etc.). With --show-agent-output, Codex JSONL events are mirrored to agent-trace.log.

Result records redact hidden resource values in resource_summary and verifier_metadata (exact matches and fields like expected_answer).

What is intentionally out of scope today

  • Signed or digest-verified result records
  • Provider credential isolation from agent containers (keys are passed via env today)
  • Host-based candidate production (all harnesses use Docker)
  • Full verifier coverage for deferred families
  • Automatic benchmark pack hashing in results

See Security Model and the repo’s docs/next.md for the engineering backlog.

Last updated on