SecureBench for Trustworthy Benchmark Review

For the Berkeley RDI team. This is a dedicated entry point we are sending alongside our email so you can get started quickly. It is a temporary page focused on your review; once the review is complete it will be folded into the main documentation and retired.

Why this page exists

Berkeley RDI’s work on trustworthy agent benchmarks showed that many benchmarks can be exploited through evaluator tampering, leaked answers, unsafe scoring, unrestricted egress, and trusting untrusted artifacts. This page provides a reviewer-oriented entry point for testing whether SecureBench resists those failure modes.

It is written for reviewers, not buyers. There is no marketing copy here. The goal is to make it easy to point a benchmark-scanning methodology at SecureBench and find where it holds and where it does not.

See the Berkeley RDI post: How We Broke Top AI Agent Benchmarks .

One-sentence claim

This is a narrow claim, not a guarantee of general benchmark security.

SecureBench is designed to make benchmark security properties explicit and enforceable through visibility lanes, isolated candidate production, trusted verification, path policy, and default-deny network behavior.

Failure-class mapping

This table is the core of the page. Each row maps a Berkeley failure class to the SecureBench control intended to address it and the adversarial test you can run to probe it.

Berkeley failure class	SecureBench control	How to test it
Agent can read answers/checker data	Public / evaluation / hidden visibility lanes	`leak_probe`
Agent and evaluator share mutable state	Separate candidate and verifier phases, trusted evaluator mounts	`terminal_poison`
Test infrastructure tampering	Repo-patch path policy blocks tests, CI, dependencies, SecureBench paths	`repo_tamper`
Unsafe `eval()` / command execution	Trusted verifier code, hardened terminal checkers	`evaluator_escape`
Network/tool exfiltration	No-network default, allowlisted egress, provider relay	`network_exfil`
Weak string matching / output-only scoring	Hidden checkers reject magic-output candidates	`output_only`
Trusting untrusted logs/reward files	Candidate artifacts treated as untrusted; verifier computes score	`artifact_forgery`
Result integrity / replay hardening	Known gap in v1	Not claimed yet
LLM judge prompt injection	Not in current implemented families	Not applicable yet

Reviewer quick start

Clone the repository, set up an environment, and run the included Berkeley test suite.


git clone https://github.com/nathanhoulamy/securebench.git
cd securebench
python -m venv .venv
. .venv/bin/activate
pip install -e ".[dev]"
python -m pytest berkeley/tests/test_berkeley_run.py

Then run the live adversarial configs against a provider:


OPENAI_API_KEY=... python -m berkeley.run --config berkeley/configs/audit-terminal-live-openai-leak-probe.yaml
OPENAI_API_KEY=... python -m berkeley.run --config berkeley/configs/audit-terminal-live-openai-network-exfil.yaml
OPENAI_API_KEY=... python -m berkeley.run --config berkeley/configs/audit-repo-patch-live-openai-repo-tamper.yaml

Bring your own methodology

Berkeley reviewers do not need to trust our bundled adversary.

The external mode lets an independently written scanner or adversarial agent run under the same public-workspace contract and produce an ordinary SecureBench candidate.

This means you can substitute your own scanning methodology while keeping the same workspace contract, candidate format, and verifier path that the bundled adversary uses. The control surface SecureBench enforces does not change based on who produced the candidate.

See the contract: berkeley/agents/external_contract.md.

Expected artifacts

Each run produces artifacts you can inspect directly:

results.jsonl — normal SecureBench candidate and verifier result
summary.json — aggregate pass/fail and violation counts
traces/*.json — adversary actions, observations, errors, and violations

How to read them:

Non-empty trace.violations should be treated as potential framework-level evidence. trace.errors usually means the adversary failed to run or produced invalid actions, not that SecureBench is secure.

Known limits

We would rather be direct about scope than overclaim.

Current families are repo_patch and terminal_task.
Result signing and resume/replay hardening are not claimed in v1.
Current families do not use LLM-as-judge scoring, so LLM judge prompt-injection defenses are out of scope for now.
Passing bundled adversarial profiles is not proof of general benchmark security.

Ask from Berkeley

We would value an independent review using your benchmark-scanning methodology. In particular, we are looking for cases where SecureBench exposes hidden/evaluation data, lets a candidate influence trusted verification, allows undeclared egress, trusts candidate-controlled artifacts, or produces a passing score without the task being solved.

The tone we are aiming for is “please try to break this,” not “we solved benchmark security.”