Skip to Content
Security Model

Security Model

SecureBench exists so benchmark makers can evaluate agents without exposing scoring data, verifier logic, or result integrity to the candidate-producing agent. Container isolation alone is not the goal — separation of visibility, execution phase, and trust boundaries is.

This page summarizes the model implemented today. For per-family detail, see the SecureBench repo docs/SECURITY.md.

Threat model

Assumptions about agents

SecureBench assumes the evaluated agent (or the code it produces) may attempt to:

  • Read hidden answers, tests, or gold patches from disk or environment
  • Modify test files, runners, or dependency/config files
  • Skip or replace verifier logic
  • Exfiltrate provider API keys from the harness container
  • Poison resumed result records (when --resume is used)

Trust boundaries

TrustedUntrusted
SecureBench framework codeCandidate text, patches, and workspace files
Benchmark pack hidden/ and eval inputsAgent-produced stdout/stderr
Verifier sandboxes (fresh per verify)Prior candidates.jsonl rows when using --resume
Tester YAML command harness configAgent-selected shell commands (provider CLIs run with bypass flags)

Out of scope (today)

  • Cryptographic signing of results
  • Proving benchmark pack integrity via digests in output records
  • Strong Python sandboxing for arbitrary candidate code
  • Hiding provider credentials from the agent container
  • Protecting against statistical leakage in poorly constructed datasets

Core design: separate production from evaluation

Agent phase Verifier phase ───────────── ────────────── public payload only → hidden + eval inputs OK harness sandbox fresh verifier sandbox candidate extraction trusted checker code

The agent never receives:

  • eval fields compiled as hidden or evaluation_inputs
  • Verifier scripts from hidden/ (until verification mounts them in a controlled way for checkers)

ResourceBundle.view_for("agent") enforces this at the data layer; materialization and path policy enforce it at the filesystem layer.

Visibility lanes

LanePurpose
publicPrompts, public assets, instructions
evaluation_inputsTests and checkers visible to verification staging, not agents
hiddenAnswers, rubrics, gold patches — evaluator only

Result summaries redact non-public values. Verifier metadata redacts exact hidden matches and fields like expected_answer.

Filesystem and materialization controls

  • Absolute paths, .., and backslashes rejected for materialized paths
  • Reserved roots and mandatory denied paths (/eval, /ground_truth, …)
  • Component-specific allowed roots under securebench/public, securebench/evaluation_inputs, securebench/evaluator
  • File-backed assets must resolve under pack asset roots and must not be symlinks
  • Docker bind mounts for public assets default to read-only

Sandbox defaults (Docker)

DockerSandbox defaults unless overridden:

  • Network disabled (network="none") for verifiers
  • All capabilities dropped; narrow cap_add only for approved dangerous verifier commands
  • Read-only root filesystem with tmpfs for /tmp
  • Memory and PID limits
  • no-new-privileges

Host workspace staging uses resolve_sandbox_host_path() to block traversal and symlink escapes.

Command policy

CommandPolicy / PolicySandbox allow or deny commands by executable name. Used for auditing verifier dangerous command decisions.

Not a security boundary: agents can invoke denied tools via shell wrappers unless the environment prevents it. Treat policy as fail-closed guidance for known verifier requirements.

Per-family status

Text families (multiple_choice, short_answer, free_response)

Closest to safe-by-default for data isolation. Verifiers do not execute candidate code and run in-process on the host (no verifier Docker sandbox). Main risks are dataset design (answer leakage in prompts) and weak free-response scoring — not container isolation failures.

code_completion

Improved supervisor isolates import-time attacks partially, but candidate and hidden tests still share a Python interpreter in one container. Malicious code may mutate process state beyond current restoration logic.

Planned hardening: subprocess isolation, hidden tests never readable by candidate code.

repo_patch

Pre-apply path policy blocks many test-infra edits. Check command still runs in candidate-mutated repo.

Planned hardening: protected test application, read-only trusted harness paths, post-apply integrity checks.

terminal_task

Checkers run in fresh verifier sandboxes with eval inputs materialized at verification time. Some checkers execute agent-produced binaries — verifier sandbox hardening still matters.

Dangerous commands (chroot) require explicit tester opt-in via verification.disallow_dangerous_commands: false.

Deferred families

Compiler mappings exist; verifiers do not. Do not treat these as adversarial-agent safe until family-specific threat models and verifiers land.

Harness security

Command harness

  • Tester-provided fixed command — trusted configuration, not agent-selected
  • Agent sees only public materialization
  • Runs in task Docker image

Provider CLI harnesses (Codex, Claude Code)

  • Tooling overlay mounted read-only; separate writable tool home
  • Egress allowlisted through framework proxy
  • CLIs invoked with approval/sandbox bypass flags appropriate to automation
  • Known gap: OPENAI_API_KEY / ANTHROPIC_API_KEY forwarded into the agent container

Planned: provider broker sidecar, subscription auth mode, secret redaction in stdout/progress

Result and audit integrity

Current protections:

  • Hidden values redacted in JSONL output
  • Structured verifier metadata without full hidden rubric text

Known gaps:

  • Results not signed
  • --resume trusts existing records without digest verification
  • No benchmark pack / task row hashes in output
  • Producer/verifier stdout may leak secrets if printed

Safe-by-default release criteria

Before claiming a family is safe for adversarial agents, the repo docs/SECURITY.md defines criteria including:

  • Agent receives only public data
  • Hidden/eval inputs not readable by candidate code
  • Verifier runs from trusted paths
  • Candidate writes constrained to allowed outputs
  • Network/credentials denied by default
  • Integrity metadata and tamper regression tests

Today: text families are closest; code_completion, repo_patch, and provider CLI harnesses need the most hardening.

Last updated on