Extending SecureBench

SecureBench is scoped to agentic benchmark execution. The supported families are repo_patch and terminal_task. Most extensions should add verifier behavior, harness integrations, audit checks, or a new agentic family contract.

Extension points

Goal	Where to implement
New agentic family schema	`securebench/families/<name>.py` and `families/registry.py`
New verifier	`securebench/verifiers/<name>.py` and `verifiers/registry.py`
New harness type	`securebench/harnesses/<name>.py`, registry, and `tester_config.py`
New sandbox backend	`securebench/sandboxes/<name>.py`
Eval visibility mapping	`benchmark_compiler.EVAL_VISIBILITY`
Materialization/path rules	`workspaces/materialization.py`, `workspaces/path_policy.py`
Audit coverage	`securebench/audit/`, `tests/audit/`

Add tests under tests/ for every extension.

Add an agentic family

New families should produce either a patch-like artifact or a workspace-like artifact unless the project scope changes.

Define the contract

In families/registry.py:


FAMILY_CONTRACTS["my_workspace_task"] = FamilyContract(
    "my_workspace_task",
    "workspace",
    requires_workspace=True,
)

The runtime currently exercises patch and workspace candidate kinds.

Validate rows

Create families/my_workspace_task.py:


def validate(row, context: str) -> None:
    reject_unknown(row.input, {"instructions", "context"}, f"{context}.input")
    reject_unknown(row.eval, {"checker", "expected_state"}, f"{context}.eval")
    required_non_empty_string(row.input, "instructions", f"{context}.input")
    required_object(row.eval, "checker", f"{context}.eval")

Map eval visibility

In benchmark_compiler.py:


EVAL_VISIBILITY["my_workspace_task"] = {
    "checker": "evaluation_inputs",
    "expected_state": "hidden",
}

Implement a verifier

Without a registered verifier, result records remain pending. Production-facing families should define verifier behavior before public examples are added.

Document the threat model

Before treating a family as safe for adversarial agents, document:

files the agent may read and modify
checker and evaluator files that remain trusted
network or dangerous-command requirements
candidate tamper detection
audit or regression tests covering the family

Add a verifier


class MyVerifier(Verifier):
    def verify(self, task, candidate, **context):
        # Treat candidate patch/workspace as hostile.
        # Read trusted data through resource_value(task, name).
        return VerificationResult(
            task_id=task.id,
            status="passed",
            passed=True,
            score=1.0,
            metadata={"verifier": "my_workspace_task"},
        )

Verifier requirements:

Use DockerSandbox(network="none") for candidate-facing checks.
Materialize eval inputs at verification time.
Mount trusted checker assets read-only and outside candidate-controlled paths.
Accept verification_policy from context when using dangerous command allowances.
Inject sandbox_factory in tests to avoid real Docker.
Return structured metadata without hidden values.

Add a harness type

Implement CandidateProducer:


class MyHarnessProducer(CandidateProducer):
    def produce(self, task: SecureBenchTask, **context) -> CandidateArtifact:
        # Materialize agent-visible resources only.
        # Run DockerSandbox.
        # Use extract_candidate(...) for patch/workspace extraction.

Wire parser and registry support in:

tester_config.py
harnesses/registry.py
harnesses/__init__.py if exported

If the harness needs outbound network, use docker_egress_policy() from harnesses/network.py with an explicit domain allowlist.

Add a sandbox backend

Implement the Sandbox interface in sandboxes/base.py:


class MySandbox(Sandbox):
    def run(self, command, *, workdir=None, timeout=None) -> CommandResult: ...
    def write_file(self, path, content): ...
    def read_file(self, path) -> str: ...
    def extract_file(self, path) -> bytes: ...

Reuse resolve_sandbox_host_path() for host-backed roots. Wrap with PolicySandbox when command auditing is needed.

Add benchmark imports

For external benchmark sources:

Convert them offline into manifest.yaml and tasks.jsonl.
Store public assets under asset_roots.public.
Store checker, test, and evaluator assets under asset_roots.eval.
Add a smoke pack and tests for the converted shape.

Test extensions


.venv/bin/python -m pytest tests/test_my_family_verifier.py -q

Use existing patterns:

tests/test_family_schemas.py for row validation
tests/test_harnesses.py and tests/test_docker_sandbox.py for sandbox behavior
tests/audit/* for security regression checks

Checklist

Family validator rejects unknown fields.
Eval visibility is mapped in the compiler.
Verifier is registered.
Agent payload excludes hidden and evaluation-only resources.
Candidate writes are constrained or audited.
Tests cover pass, fail, and at least one tamper attempt.
Security model documentation is updated when behavior changes.