Skip to Content
Benchmark Review

SecureBench for Trustworthy Benchmark Review

For the Berkeley RDI team. This is a dedicated entry point we are sending alongside our email so you can get started quickly. It is a temporary page focused on your review; once the review is complete it will be folded into the main documentation and retired.

Why this page exists

Berkeley RDI’s work on trustworthy agent benchmarks showed that many benchmarks can be exploited through evaluator tampering, leaked answers, unsafe scoring, unrestricted egress, and trusting untrusted artifacts. This page provides a reviewer-oriented entry point for testing whether SecureBench resists those failure modes.

It is written for reviewers, not buyers. There is no marketing copy here. The goal is to make it easy to point a benchmark-scanning methodology at SecureBench and find where it holds and where it does not.

See the Berkeley RDI post: How We Broke Top AI Agent Benchmarks .

One-sentence claim

This is a narrow claim, not a guarantee of general benchmark security.

SecureBench is designed to make benchmark security properties explicit and enforceable through visibility lanes, isolated candidate production, trusted verification, path policy, and default-deny network behavior.

Failure-class mapping

This table is the core of the page. Each row maps a Berkeley failure class to the SecureBench control intended to address it and the adversarial test you can run to probe it.

Berkeley failure classSecureBench controlHow to test it
Agent can read answers/checker dataPublic / evaluation / hidden visibility lanesleak_probe
Agent and evaluator share mutable stateSeparate candidate and verifier phases, trusted evaluator mountsterminal_poison
Test infrastructure tamperingRepo-patch path policy blocks tests, CI, dependencies, SecureBench pathsrepo_tamper
Unsafe eval() / command executionTrusted verifier code, hardened terminal checkersevaluator_escape
Network/tool exfiltrationNo-network default, allowlisted egress, provider relaynetwork_exfil
Weak string matching / output-only scoringHidden checkers reject magic-output candidatesoutput_only
Trusting untrusted logs/reward filesCandidate artifacts treated as untrusted; verifier computes scoreartifact_forgery
Result integrity / replay hardeningKnown gap in v1Not claimed yet
LLM judge prompt injectionNot in current implemented familiesNot applicable yet

Reviewer quick start

Clone the repository, set up an environment, and run the included Berkeley test suite.

git clone https://github.com/nathanhoulamy/securebench.git cd securebench python -m venv .venv . .venv/bin/activate pip install -e ".[dev]" python -m pytest berkeley/tests/test_berkeley_run.py

Then run the live adversarial configs against a provider:

OPENAI_API_KEY=... python -m berkeley.run --config berkeley/configs/audit-terminal-live-openai-leak-probe.yaml OPENAI_API_KEY=... python -m berkeley.run --config berkeley/configs/audit-terminal-live-openai-network-exfil.yaml OPENAI_API_KEY=... python -m berkeley.run --config berkeley/configs/audit-repo-patch-live-openai-repo-tamper.yaml

Bring your own methodology

Berkeley reviewers do not need to trust our bundled adversary.

The external mode lets an independently written scanner or adversarial agent run under the same public-workspace contract and produce an ordinary SecureBench candidate.

This means you can substitute your own scanning methodology while keeping the same workspace contract, candidate format, and verifier path that the bundled adversary uses. The control surface SecureBench enforces does not change based on who produced the candidate.

See the contract: berkeley/agents/external_contract.md.

Expected artifacts

Each run produces artifacts you can inspect directly:

  • results.jsonl — normal SecureBench candidate and verifier result
  • summary.json — aggregate pass/fail and violation counts
  • traces/*.json — adversary actions, observations, errors, and violations

How to read them:

Non-empty trace.violations should be treated as potential framework-level evidence. trace.errors usually means the adversary failed to run or produced invalid actions, not that SecureBench is secure.

Known limits

We would rather be direct about scope than overclaim.

  • Current families are repo_patch and terminal_task.
  • Result signing and resume/replay hardening are not claimed in v1.
  • Current families do not use LLM-as-judge scoring, so LLM judge prompt-injection defenses are out of scope for now.
  • Passing bundled adversarial profiles is not proof of general benchmark security.

Ask from Berkeley

We would value an independent review using your benchmark-scanning methodology. In particular, we are looking for cases where SecureBench exposes hidden/evaluation data, lets a candidate influence trusted verification, allows undeclared egress, trusts candidate-controlled artifacts, or produces a passing score without the task being solved.

The tone we are aiming for is “please try to break this,” not “we solved benchmark security.”

Last updated on