Tasks and evaluators

SecureBench separates the agent-visible task payload from the trusted scoring path. The active runtime supports repo_patch and terminal_task.

Normalized task model

After compilation, each row becomes a SecureBenchTask:


@dataclass(frozen=True)
class SecureBenchTask:
    id: str
    benchmark_id: str
    task_type: str
    metadata: dict[str, Any]
    resources: ResourceBundle

Family validators enforce row-specific requirements. The active task model does not use specialized text or code-completion subclasses.

Component payloads

Method	Component	Visibilities included
`agent_payload()` / `public_payload()`	agent	public
`evaluation_payload()`	test_sandbox	public + evaluation_inputs
`hidden_payload()`	evaluator	public + hidden
`resource_summary()`	result	all names; non-public values redacted

Harnesses must not pass hidden or evaluation-only resources to agents.

Family contracts

Each active family declares a candidate artifact contract.

Family	Candidate kind	Default extraction
`repo_patch`	`patch`	canonical `git diff HEAD --binary --full-index --no-ext-diff --no-textconv --`
`terminal_task`	`workspace`	final host workspace directory

CandidateArtifact.for_task() maps the artifact to the verifier input:

repo_patch: candidate patch text
terminal_task: workspace directory path

Unknown family names fail validation.

Candidate production


SecureBenchTask
  -> VisibilityAwareMaterializer(component=agent)
  -> write task.json + public assets
  -> DockerSandbox.run(harness command)
  -> extract_candidate(spec)
  -> CandidateArtifact

Extraction modes

Mode	Used by	Failure modes
`git_diff`	`repo_patch`	Git add/diff timeout or non-zero exit
`workspace`	`terminal_task`	Harness exit code non-zero or missing sandbox root

Producer timeout raises CandidateProductionTimeout. tester_run writes a scored failure record with failure_reason: producer_timeout.

Verifier model


class Verifier(ABC):
    def verify(self, task: SecureBenchTask, candidate: str, **context) -> VerificationResult

verify_candidate() in tester_run.py resolves the verifier through verifier_for_task_type() and passes verification_policy from tester YAML.

If no verifier is registered, verification is skipped and the record keeps verification_status: "pending". The current supported families both have verifiers.

`repo_patch` verifier

The repo-patch verifier applies and scores an agent-produced git diff in the benchmark image.

Current behavior:

Requires image HEAD to match input.base_commit.
Requires canonical diff --git candidate patches. Headerless unified diffs fail closed.
Applies setup, candidate, and test patches through git apply --binary over stdin.
Commits setup state as a verifier-only baseline with hooks and signing disabled.
Checks candidate paths before apply and after apply using NUL-delimited Git output.
Runs the declared argv-array check command with network disabled by default.
Denies common test, CI, dependency, lockfile, shell-script, Python startup customization, and SecureBench-owned paths.
Supports eval.candidate_policy.allow_paths, allow_sensitive_paths, and patch_preserved_paths.
Reports containment_profile: "shared_runtime" and hidden_test_runtime_secrecy: false.

Known limitation: the check command runs in the candidate-mutated repository. Benchmark-authored commands, allowlists, setup patches, and hidden tests remain part of the trust base.

`terminal_task` verifier

The terminal-task verifier scores the final workspace state with trusted checker assets.

Current behavior:

Takes the produced workspace directory as the candidate.
Materializes evaluation inputs into workspace staging.
Mounts trusted checker files read-only under /opt/securebench/evaluator.
Runs Docker with network=none by default.
Supports pytest and script checker sources.
Honors dangerous-command policy for eval.needed_commands.

pytest mode removes candidate workspace import paths and invokes pytest with isolated Python. script mode starts Bash without profiles, removes Python startup-path environment variables, disables user site loading, and exports python and python3 launchers that reject candidate-owned interpreters.

Known limitation: checker code may intentionally execute agent-produced binaries or inspect agent-produced files. Checker authors must treat workspace state as hostile input.

Verifier requirements

Register the verifier in verifiers/registry.py.
Accept verification_policy in **context if verifier commands may need dangerous-command allowances.
Read trusted data through resource_value(task, name) or task payload helpers.
Treat candidate patches and workspaces as hostile.
Use Docker with network="none" unless the family requires network access.
Return structured metadata without hidden resource values.
Add tamper tests under tests/test_*_verifier.py.

Eval field visibility

Family	Field	Visibility
`repo_patch`	`tests`, `candidate_policy`	`evaluation_inputs`
`repo_patch`	`gold_patch`	`hidden`
`terminal_task`	`checker`, `needed_commands`, `run_tests`, `test_files`	`evaluation_inputs`
`terminal_task`	`expected_state`	`hidden`

Unknown eval keys fail family validation for active families.

Helper functions

In tasks.py:

resource_value(task, name, default=None)
resource_text(task, name, default="")

Use these helpers instead of reading raw JSON from candidate artifacts.