A Secure Standard for Agentic Benchmarks

SecureBench is built for agentic AI, the model and the harness, tools, and shell around it. It defines how benchmark families are structured, from pack layout to row schemas, and enforces that contract at run time: answers stay away from the agent, execution is split across two sandboxes, and scoring runs from trusted code, so a result reflects the task instead of a shortcut.

Get started Read the security model

Benchmark scores decide which models ship

Teams trust benchmarks to say whether a model is good. But modern evaluations run agents, a model plus the harness and tools around it, and most benchmarks were not built to resist one. As agents get better at reaching a goal, they get better at reaching it the wrong way, and a loose harness lets them. When that happens, the number stops measuring the task.

leak
Leaked answers
Gold patches, expected state, and checker data sit where the agent can read them.
tamper
Tampered verification
The agent edits the tests, dependencies, or scorer that grade its own work.
game
Gamed scoring
Output-only checks pass on candidates that never solved the task.

One contract, two sandboxes, no trust in the agent

SecureBench treats the agent as untrusted and gives every benchmark the same shape. A visibility contract sets what the agent can see, and two separate sandboxes split producing a candidate from grading it.

Visibility contract

Every input is assigned a lane. The lane decides which component view can read it.

public
Prompts, instructions, and starter assets.
evaluation_inputs
Tests and checkers, staged only for grading.
hidden
Gold patches, expected state, and analysis data.

Two sandboxes

The agent never runs in the same environment that grades it.

Sandbox 1

Candidate production

Seespublic

Runs the agent harness under a path policy.
No network by default; egress is allowlisted.
Produces a candidate; never sees verifier code.

Sandbox 2

Verification

Seesevaluation_inputs

Fresh sandbox per check.
Trusted checker code computes the score.
Hidden values stay in the evaluator view.

Modular by design. The repo_patch and terminal_task families ship today. A new benchmark type reuses the same lanes and two-phase execution by defining a family schema, so the security model does not change when the task does.

Start reading

Getting startedInstall, run a benchmark pack, and read JSONL output.Benchmark packsManifests, JSONL rows, families, and visibility lanes.Security modelThreat model, trust boundary, and known gaps.