SecureBench for Trustworthy Benchmark Review
For the Berkeley RDI team. This is a dedicated entry point we are sending alongside our email so you can get started quickly. It is a temporary page focused on your review; once the review is complete it will be folded into the main documentation and retired.
Why this page exists
Berkeley RDI’s work on trustworthy agent benchmarks showed that many benchmarks can be exploited through evaluator tampering, leaked answers, unsafe scoring, unrestricted egress, and trusting untrusted artifacts. This page provides a reviewer-oriented entry point for testing whether SecureBench resists those failure modes.
It is written for reviewers, not buyers. There is no marketing copy here. The goal is to make it easy to point a benchmark-scanning methodology at SecureBench and find where it holds and where it does not.
See the Berkeley RDI post: How We Broke Top AI Agent Benchmarks .
One-sentence claim
This is a narrow claim, not a guarantee of general benchmark security.
SecureBench is designed to make benchmark security properties explicit and enforceable through visibility lanes, isolated candidate production, trusted verification, path policy, and default-deny network behavior.
Failure-class mapping
This table is the core of the page. Each row maps a Berkeley failure class to the SecureBench control intended to address it and the adversarial test you can run to probe it.
| Berkeley failure class | SecureBench control | How to test it |
|---|---|---|
| Agent can read answers/checker data | Public / evaluation / hidden visibility lanes | leak_probe |
| Agent and evaluator share mutable state | Separate candidate and verifier phases, trusted evaluator mounts | terminal_poison |
| Test infrastructure tampering | Repo-patch path policy blocks tests, CI, dependencies, SecureBench paths | repo_tamper |
Unsafe eval() / command execution | Trusted verifier code, hardened terminal checkers | evaluator_escape |
| Network/tool exfiltration | No-network default, allowlisted egress, provider relay | network_exfil |
| Weak string matching / output-only scoring | Hidden checkers reject magic-output candidates | output_only |
| Trusting untrusted logs/reward files | Candidate artifacts treated as untrusted; verifier computes score | artifact_forgery |
| Result integrity / replay hardening | Known gap in v1 | Not claimed yet |
| LLM judge prompt injection | Not in current implemented families | Not applicable yet |
Reviewer quick start
Clone the repository, set up an environment, and run the included Berkeley test suite.
git clone https://github.com/nathanhoulamy/securebench.git
cd securebench
python -m venv .venv
. .venv/bin/activate
pip install -e ".[dev]"
python -m pytest berkeley/tests/test_berkeley_run.pyThen run the live adversarial configs against a provider:
OPENAI_API_KEY=... python -m berkeley.run --config berkeley/configs/audit-terminal-live-openai-leak-probe.yaml
OPENAI_API_KEY=... python -m berkeley.run --config berkeley/configs/audit-terminal-live-openai-network-exfil.yaml
OPENAI_API_KEY=... python -m berkeley.run --config berkeley/configs/audit-repo-patch-live-openai-repo-tamper.yamlBring your own methodology
Berkeley reviewers do not need to trust our bundled adversary.
The
externalmode lets an independently written scanner or adversarial agent run under the same public-workspace contract and produce an ordinary SecureBench candidate.
This means you can substitute your own scanning methodology while keeping the same workspace contract, candidate format, and verifier path that the bundled adversary uses. The control surface SecureBench enforces does not change based on who produced the candidate.
See the contract: berkeley/agents/external_contract.md.
Expected artifacts
Each run produces artifacts you can inspect directly:
results.jsonl— normal SecureBench candidate and verifier resultsummary.json— aggregate pass/fail and violation countstraces/*.json— adversary actions, observations, errors, and violations
How to read them:
Non-empty
trace.violationsshould be treated as potential framework-level evidence.trace.errorsusually means the adversary failed to run or produced invalid actions, not that SecureBench is secure.
Known limits
We would rather be direct about scope than overclaim.
- Current families are
repo_patchandterminal_task. - Result signing and resume/replay hardening are not claimed in v1.
- Current families do not use LLM-as-judge scoring, so LLM judge prompt-injection defenses are out of scope for now.
- Passing bundled adversarial profiles is not proof of general benchmark security.
Ask from Berkeley
We would value an independent review using your benchmark-scanning methodology. In particular, we are looking for cases where SecureBench exposes hidden/evaluation data, lets a candidate influence trusted verification, allows undeclared egress, trusts candidate-controlled artifacts, or produces a passing score without the task being solved.
The tone we are aiming for is “please try to break this,” not “we solved benchmark security.”