Skip to Content
Tasks & Evaluators

Tasks & Evaluators

SecureBench separates what the agent sees (compiled task + public materialization) from how answers are scored (family verifiers operating on candidate artifacts).

Normalized task model

After compilation, each row becomes a SecureBenchTask:

@dataclass(frozen=True) class SecureBenchTask: id: str benchmark_id: str task_type: str # family name metadata: dict[str, Any] resources: ResourceBundle

Specialized subclasses exist for MultipleChoiceTask and CodeCompletionTask (required resource checks at compile time).

Component payloads

MethodComponentVisibilities included
agent_payload() / public_payload()agentpublic
evaluation_payload()test_sandboxpublic + evaluation_inputs
hidden_payload()evaluatorpublic + hidden (not evaluation_inputs)
resource_summary()resultall names; non-public values redacted

Harnesses must never pass hidden or evaluation-only resources to agents.

Family contracts

Each active family declares a candidate kind in families/registry.py:

Familycandidate_kindDefault extraction
multiple_choice, short_answer, free_responsetextstdout (or candidate.txt if stdout disabled)
code_completioncodefile candidate.py
repo_patchpatchgit diff --binary in task workdir
terminal_taskworkspacehost workspace directory path

CandidateArtifact.for_task() maps the artifact to the string passed to verifiers:

  • repo_patchpatch
  • terminal_taskworkspace (directory path)
  • others → text

Schema validation

Validation happens at compile time via validate_benchmark_row_family():

  • Active families — strict field sets; unknown fields raise ConfigError
  • Deferred families — no validator; rows remain permissive until a verifier exists

Validators live in securebench/families/<family>.py. Shared helpers are in families/base.py (reject_unknown, required_non_empty_string, etc.).

Candidate production flow

SecureBenchTask → VisibilityAwareMaterializer (component=agent) → write task.json + public assets → DockerSandbox.run(harness command) → extract_candidate(spec) → CandidateArtifact

Extraction modes

Defined in candidates/extraction.py:

ModeWhen usedFailure modes
stdoutDefault text familiesEmpty stdout still produces candidate; verifier may fail
fileCode completion; optional textConfigError if file missing
git_diffRepo patchTimeout on diff command
workspaceTerminal taskRequires harness exit code 0 and sandbox.root

Producer timeout raises CandidateProductionTimeout; tester_run writes a scored failure record with failure_reason: producer_timeout.

Verifier model

Interface

class Verifier(ABC): def verify(self, task: SecureBenchTask, candidate: str, **context) -> VerificationResult

verify_candidate() in tester_run.py resolves the verifier via verifier_for_task_type() and passes verification_policy from tester YAML.

VerificationResult

FieldDescription
task_idTask ID
statuspassed or failed (verifier-specific)
passedBoolean
scoreFloat (typically 0.0 or 1.0)
stdout, stderrVerifier sandbox output
metadataStructured details; redacted before JSONL write

If no verifier is registered, verification is skipped and the record keeps verification_status: "pending".

Verifier implementations

multiple_choice

  • Compares normalized candidate text against hidden answer labels and choice text
  • Supports answer labels (A, B, …), choice text, and “Final answer:” parsing heuristics
  • Does not execute candidate code
  • Runs in-process on the host (no Docker sandbox)

short_answer

  • Deterministic matching: exact, substring, numeric tolerance
  • Hidden accepted answers never appear in agent payload
  • Runs in-process (no Docker sandbox)

free_response

  • Requires a structured hidden rubric object with type: "contains_any" (default), non-empty accepted_answers, optional rejected_answers, and optional min_token_f1 (0–1, default 1.0)
  • Scoring uses substring containment and token F1 fallback — not a judge model
  • Known limitation: intentionally weak; suitable for smoke tests, not production TruthfulQA-style evaluation
  • Runs in-process (no Docker sandbox)
  • Planned: stronger rubric representation or trusted judge interface (docs/next.md)

code_completion

  • Python only
  • Writes candidate module, trusted test runner, and supervisor script into a fresh Docker sandbox (network=none)
  • Hidden tests run via supervisor that restores import state and blocks some exit/monkeypatch attacks
  • Known limitation: candidate and tests share one interpreter after import

repo_patch

  • Applies candidate patch in benchmark image with path policy checks
  • Applies setup/test patches from evaluation-input paths
  • Runs declared check command
  • Default deny policy blocks edits to tests, CI, lockfiles, shell scripts, etc.
  • Known limitation: check command runs in candidate-mutated repository

terminal_task

  • Takes produced workspace directory as candidate
  • Materializes evaluation inputs into workspace staging
  • Runs trusted eval.checker.command in Docker with network=none
  • Honors dangerous command policy for eval.needed_commands

Writing custom verifiers safely

  1. Register the verifier in verifiers/registry.py for your task_type.
  2. Implement Verifier.verify(); accept verification_policy in **context if you run sandboxed commands that may need dangerous command allowances.
  3. Read secrets only from task.hidden_payload() or resource_value(task, name) — never from candidate-controlled files without validation.
  4. Treat candidate input as hostile:
    • Text: parse defensively; do not eval() candidate content
    • Code: prefer separate processes/interpreters (see code_completion gaps)
    • Patches: validate paths before apply (follow RepoPatchVerifier patterns)
    • Workspaces: run checkers in fresh sandboxes; do not trust agent-modified test scripts
  5. Use Docker with network="none" unless the family explicitly requires network.
  6. Return structured metadata without echoing hidden resource values; rely on tester_run._redact_verifier_metadata() as a backstop, not the primary control.
  7. Add tests under tests/test_*_verifier.py covering pass, fail, and tamper cases.

Optional injection for tests:

verifier = TerminalTaskVerifier(sandbox_factory=my_factory)

Eval field visibility reference

From benchmark_compiler.EVAL_VISIBILITY:

FamilyFieldVisibility
multiple_choiceanswerhidden
short_answeraccepted_answers, tolerancehidden
free_responsereference_answer, rubrichidden
code_completiontestsevaluation_inputs
code_completionreference_solution, canonical_solutionhidden
repo_patchtestsevaluation_inputs
repo_patchgold_patchhidden
terminal_taskchecker, needed_commands, run_tests, test_filesevaluation_inputs
terminal_taskexpected_statehidden

Unknown eval keys for a family default to hidden.

Helper functions for verifiers

In tasks.py:

  • resource_value(task, name, default=None)
  • resource_text(task, name, default="")
  • optional_resource_text(task, name)
  • resource_tuple(task, name)

Use these rather than accessing raw JSON from candidate artifacts.

Last updated on