Skip to Content
Introduction

SecureBench Developer Documentation

SecureBench is a benchmark execution framework for evaluating AI systems while keeping public task data, candidate-producing harnesses, and trusted verifier data separated. It is designed for benchmark operators who need to run agents against tasks without letting those agents read answers, tamper with scoring logic, or weaken verification.

This site is the main reference for understanding, using, extending, and contributing to SecureBench.

What problem does SecureBench solve?

Most benchmark runners treat the agent environment and the evaluator as one process. That works for trusted models, but it breaks down when:

  • Hidden answers or tests must not leak to the agent.
  • Candidate code or patches could modify test infrastructure.
  • Results need to be auditable and reproducible.
  • Provider API keys or benchmark secrets must not live in the same shell as an evaluated agent.

SecureBench addresses this by compiling each task into visibility lanes (public, evaluation_inputs, hidden), running candidate production in a sandboxed harness, and running verification afterward from trusted code paths with component-specific resource views.

How SecureBench differs from a normal benchmark runner

Typical runnerSecureBench
One process reads prompts and scores answersSeparate harness (producer) and verifier phases
Tests and answers may share a filesystem with the agentResources are materialized per component with path policy
Shell/network access is often unrestrictedDocker sandboxes default to no network; optional domain-allowlisted egress for provider CLIs
Scoring logic may be co-located with agent codeVerifiers run after candidate extraction; hidden data is never in agent_payload()

Current implementation status

The active pipeline is benchmark pack + tester YAML:

  • Manifest YAML + JSONL task rows
  • Visibility-aware compilation into SecureBenchTask
  • Harness types: command, codex, claude_code
  • Implemented verifiers: multiple_choice, short_answer, free_response, code_completion, repo_patch, terminal_task
  • JSONL result output (candidates.jsonl)

Additional families (tool_call, browser_task, desktop_task, artifact_task, multimodal_qa, preference_pair) have compiler visibility mappings but no registered verifiers yet. Tasks in those families compile but return verification_status: "pending".

Legacy Hugging Face–oriented modules remain under securebench/legacy/ and are not used by the current CLI.

Documentation map

SectionTopics
Getting StartedInstall, run smoke benchmarks, interpret outputs
ArchitectureComponents, lifecycle, data flow
Benchmark PacksManifests, JSONL rows, assets, families
Tester ConfigurationHarness selection, verification policy, egress
Tasks & EvaluatorsSchemas, candidate kinds, verifier behavior
Security ModelThreat model, controls, known gaps
Extending SecureBenchNew families, verifiers, harnesses, sandboxes
ReferenceCLI, YAML, schemas, types
TroubleshootingCommon errors and fixes

Canonical standards in the SecureBench repo

The SecureBench repository also ships HTML standards under docs/:

  • benchmark-family-standard.html — benchmark pack and family schema
  • benchmark-family-examples.html — practical family examples
  • tester-yaml-standard.html — tester YAML and harness configuration
  • SECURITY.md — detailed per-family security notes
  • next.md — current engineering follow-ups
  • legacy.md — archived prototype layout

This documentation site expands those standards with architecture context and implementation detail.

Last updated on