SecureBench Developer Documentation
SecureBench is a benchmark execution framework for evaluating AI systems while keeping public task data, candidate-producing harnesses, and trusted verifier data separated. It is designed for benchmark operators who need to run agents against tasks without letting those agents read answers, tamper with scoring logic, or weaken verification.
This site is the main reference for understanding, using, extending, and contributing to SecureBench.
What problem does SecureBench solve?
Most benchmark runners treat the agent environment and the evaluator as one process. That works for trusted models, but it breaks down when:
- Hidden answers or tests must not leak to the agent.
- Candidate code or patches could modify test infrastructure.
- Results need to be auditable and reproducible.
- Provider API keys or benchmark secrets must not live in the same shell as an evaluated agent.
SecureBench addresses this by compiling each task into visibility lanes (public, evaluation_inputs, hidden), running candidate production in a sandboxed harness, and running verification afterward from trusted code paths with component-specific resource views.
How SecureBench differs from a normal benchmark runner
| Typical runner | SecureBench |
|---|---|
| One process reads prompts and scores answers | Separate harness (producer) and verifier phases |
| Tests and answers may share a filesystem with the agent | Resources are materialized per component with path policy |
| Shell/network access is often unrestricted | Docker sandboxes default to no network; optional domain-allowlisted egress for provider CLIs |
| Scoring logic may be co-located with agent code | Verifiers run after candidate extraction; hidden data is never in agent_payload() |
Current implementation status
The active pipeline is benchmark pack + tester YAML:
- Manifest YAML + JSONL task rows
- Visibility-aware compilation into
SecureBenchTask - Harness types:
command,codex,claude_code - Implemented verifiers:
multiple_choice,short_answer,free_response,code_completion,repo_patch,terminal_task - JSONL result output (
candidates.jsonl)
Additional families (tool_call, browser_task, desktop_task, artifact_task, multimodal_qa, preference_pair) have compiler visibility mappings but no registered verifiers yet. Tasks in those families compile but return verification_status: "pending".
Legacy Hugging Face–oriented modules remain under securebench/legacy/ and are not used by the current CLI.
Documentation map
| Section | Topics |
|---|---|
| Getting Started | Install, run smoke benchmarks, interpret outputs |
| Architecture | Components, lifecycle, data flow |
| Benchmark Packs | Manifests, JSONL rows, assets, families |
| Tester Configuration | Harness selection, verification policy, egress |
| Tasks & Evaluators | Schemas, candidate kinds, verifier behavior |
| Security Model | Threat model, controls, known gaps |
| Extending SecureBench | New families, verifiers, harnesses, sandboxes |
| Reference | CLI, YAML, schemas, types |
| Troubleshooting | Common errors and fixes |
Canonical standards in the SecureBench repo
The SecureBench repository also ships HTML standards under docs/:
benchmark-family-standard.html— benchmark pack and family schemabenchmark-family-examples.html— practical family examplestester-yaml-standard.html— tester YAML and harness configurationSECURITY.md— detailed per-family security notesnext.md— current engineering follow-upslegacy.md— archived prototype layout
This documentation site expands those standards with architecture context and implementation detail.