Security Model
SecureBench exists so benchmark makers can evaluate agents without exposing scoring data, verifier logic, or result integrity to the candidate-producing agent. Container isolation alone is not the goal — separation of visibility, execution phase, and trust boundaries is.
This page summarizes the model implemented today. For per-family detail, see the SecureBench repo docs/SECURITY.md.
Threat model
Assumptions about agents
SecureBench assumes the evaluated agent (or the code it produces) may attempt to:
- Read hidden answers, tests, or gold patches from disk or environment
- Modify test files, runners, or dependency/config files
- Skip or replace verifier logic
- Exfiltrate provider API keys from the harness container
- Poison resumed result records (when
--resumeis used)
Trust boundaries
| Trusted | Untrusted |
|---|---|
| SecureBench framework code | Candidate text, patches, and workspace files |
Benchmark pack hidden/ and eval inputs | Agent-produced stdout/stderr |
| Verifier sandboxes (fresh per verify) | Prior candidates.jsonl rows when using --resume |
| Tester YAML command harness config | Agent-selected shell commands (provider CLIs run with bypass flags) |
Out of scope (today)
- Cryptographic signing of results
- Proving benchmark pack integrity via digests in output records
- Strong Python sandboxing for arbitrary candidate code
- Hiding provider credentials from the agent container
- Protecting against statistical leakage in poorly constructed datasets
Core design: separate production from evaluation
Agent phase Verifier phase
───────────── ──────────────
public payload only → hidden + eval inputs OK
harness sandbox fresh verifier sandbox
candidate extraction trusted checker codeThe agent never receives:
evalfields compiled ashiddenorevaluation_inputs- Verifier scripts from
hidden/(until verification mounts them in a controlled way for checkers)
ResourceBundle.view_for("agent") enforces this at the data layer; materialization and path policy enforce it at the filesystem layer.
Visibility lanes
| Lane | Purpose |
|---|---|
public | Prompts, public assets, instructions |
evaluation_inputs | Tests and checkers visible to verification staging, not agents |
hidden | Answers, rubrics, gold patches — evaluator only |
Result summaries redact non-public values. Verifier metadata redacts exact hidden matches and fields like expected_answer.
Filesystem and materialization controls
- Absolute paths,
.., and backslashes rejected for materialized paths - Reserved roots and mandatory denied paths (
/eval,/ground_truth, …) - Component-specific allowed roots under
securebench/public,securebench/evaluation_inputs,securebench/evaluator - File-backed assets must resolve under pack asset roots and must not be symlinks
- Docker bind mounts for public assets default to read-only
Sandbox defaults (Docker)
DockerSandbox defaults unless overridden:
- Network disabled (
network="none") for verifiers - All capabilities dropped; narrow
cap_addonly for approved dangerous verifier commands - Read-only root filesystem with tmpfs for
/tmp - Memory and PID limits
no-new-privileges
Host workspace staging uses resolve_sandbox_host_path() to block traversal and symlink escapes.
Command policy
CommandPolicy / PolicySandbox allow or deny commands by executable name. Used for auditing verifier dangerous command decisions.
Not a security boundary: agents can invoke denied tools via shell wrappers unless the environment prevents it. Treat policy as fail-closed guidance for known verifier requirements.
Per-family status
Text families (multiple_choice, short_answer, free_response)
Closest to safe-by-default for data isolation. Verifiers do not execute candidate code and run in-process on the host (no verifier Docker sandbox). Main risks are dataset design (answer leakage in prompts) and weak free-response scoring — not container isolation failures.
code_completion
Improved supervisor isolates import-time attacks partially, but candidate and hidden tests still share a Python interpreter in one container. Malicious code may mutate process state beyond current restoration logic.
Planned hardening: subprocess isolation, hidden tests never readable by candidate code.
repo_patch
Pre-apply path policy blocks many test-infra edits. Check command still runs in candidate-mutated repo.
Planned hardening: protected test application, read-only trusted harness paths, post-apply integrity checks.
terminal_task
Checkers run in fresh verifier sandboxes with eval inputs materialized at verification time. Some checkers execute agent-produced binaries — verifier sandbox hardening still matters.
Dangerous commands (chroot) require explicit tester opt-in via verification.disallow_dangerous_commands: false.
Deferred families
Compiler mappings exist; verifiers do not. Do not treat these as adversarial-agent safe until family-specific threat models and verifiers land.
Harness security
Command harness
- Tester-provided fixed command — trusted configuration, not agent-selected
- Agent sees only public materialization
- Runs in task Docker image
Provider CLI harnesses (Codex, Claude Code)
- Tooling overlay mounted read-only; separate writable tool home
- Egress allowlisted through framework proxy
- CLIs invoked with approval/sandbox bypass flags appropriate to automation
- Known gap:
OPENAI_API_KEY/ANTHROPIC_API_KEYforwarded into the agent container
Planned: provider broker sidecar, subscription auth mode, secret redaction in stdout/progress
Result and audit integrity
Current protections:
- Hidden values redacted in JSONL output
- Structured verifier metadata without full hidden rubric text
Known gaps:
- Results not signed
--resumetrusts existing records without digest verification- No benchmark pack / task row hashes in output
- Producer/verifier stdout may leak secrets if printed
Safe-by-default release criteria
Before claiming a family is safe for adversarial agents, the repo docs/SECURITY.md defines criteria including:
- Agent receives only public data
- Hidden/eval inputs not readable by candidate code
- Verifier runs from trusted paths
- Candidate writes constrained to allowed outputs
- Network/credentials denied by default
- Integrity metadata and tamper regression tests
Today: text families are closest; code_completion, repo_patch, and provider CLI harnesses need the most hardening.
Related reading
- Architecture — how controls are wired
- Tester Configuration — dangerous command opt-in
- Extending SecureBench — add controls when extending verifiers