Tasks & Evaluators
SecureBench separates what the agent sees (compiled task + public materialization) from how answers are scored (family verifiers operating on candidate artifacts).
Normalized task model
After compilation, each row becomes a SecureBenchTask:
@dataclass(frozen=True)
class SecureBenchTask:
id: str
benchmark_id: str
task_type: str # family name
metadata: dict[str, Any]
resources: ResourceBundleSpecialized subclasses exist for MultipleChoiceTask and CodeCompletionTask (required resource checks at compile time).
Component payloads
| Method | Component | Visibilities included |
|---|---|---|
agent_payload() / public_payload() | agent | public |
evaluation_payload() | test_sandbox | public + evaluation_inputs |
hidden_payload() | evaluator | public + hidden (not evaluation_inputs) |
resource_summary() | result | all names; non-public values redacted |
Harnesses must never pass hidden or evaluation-only resources to agents.
Family contracts
Each active family declares a candidate kind in families/registry.py:
| Family | candidate_kind | Default extraction |
|---|---|---|
multiple_choice, short_answer, free_response | text | stdout (or candidate.txt if stdout disabled) |
code_completion | code | file candidate.py |
repo_patch | patch | git diff --binary in task workdir |
terminal_task | workspace | host workspace directory path |
CandidateArtifact.for_task() maps the artifact to the string passed to verifiers:
repo_patch→patchterminal_task→workspace(directory path)- others →
text
Schema validation
Validation happens at compile time via validate_benchmark_row_family():
- Active families — strict field sets; unknown fields raise
ConfigError - Deferred families — no validator; rows remain permissive until a verifier exists
Validators live in securebench/families/<family>.py. Shared helpers are in families/base.py (reject_unknown, required_non_empty_string, etc.).
Candidate production flow
SecureBenchTask
→ VisibilityAwareMaterializer (component=agent)
→ write task.json + public assets
→ DockerSandbox.run(harness command)
→ extract_candidate(spec)
→ CandidateArtifactExtraction modes
Defined in candidates/extraction.py:
| Mode | When used | Failure modes |
|---|---|---|
stdout | Default text families | Empty stdout still produces candidate; verifier may fail |
file | Code completion; optional text | ConfigError if file missing |
git_diff | Repo patch | Timeout on diff command |
workspace | Terminal task | Requires harness exit code 0 and sandbox.root |
Producer timeout raises CandidateProductionTimeout; tester_run writes a scored failure record with failure_reason: producer_timeout.
Verifier model
Interface
class Verifier(ABC):
def verify(self, task: SecureBenchTask, candidate: str, **context) -> VerificationResultverify_candidate() in tester_run.py resolves the verifier via verifier_for_task_type() and passes verification_policy from tester YAML.
VerificationResult
| Field | Description |
|---|---|
task_id | Task ID |
status | passed or failed (verifier-specific) |
passed | Boolean |
score | Float (typically 0.0 or 1.0) |
stdout, stderr | Verifier sandbox output |
metadata | Structured details; redacted before JSONL write |
If no verifier is registered, verification is skipped and the record keeps verification_status: "pending".
Verifier implementations
multiple_choice
- Compares normalized candidate text against hidden answer labels and choice text
- Supports answer labels (
A,B, …), choice text, and “Final answer:” parsing heuristics - Does not execute candidate code
- Runs in-process on the host (no Docker sandbox)
short_answer
- Deterministic matching: exact, substring, numeric tolerance
- Hidden accepted answers never appear in agent payload
- Runs in-process (no Docker sandbox)
free_response
- Requires a structured hidden rubric object with
type: "contains_any"(default), non-emptyaccepted_answers, optionalrejected_answers, and optionalmin_token_f1(0–1, default1.0) - Scoring uses substring containment and token F1 fallback — not a judge model
- Known limitation: intentionally weak; suitable for smoke tests, not production TruthfulQA-style evaluation
- Runs in-process (no Docker sandbox)
- Planned: stronger rubric representation or trusted judge interface (
docs/next.md)
code_completion
- Python only
- Writes candidate module, trusted test runner, and supervisor script into a fresh Docker sandbox (
network=none) - Hidden tests run via supervisor that restores import state and blocks some exit/monkeypatch attacks
- Known limitation: candidate and tests share one interpreter after import
repo_patch
- Applies candidate patch in benchmark image with path policy checks
- Applies setup/test patches from evaluation-input paths
- Runs declared check command
- Default deny policy blocks edits to tests, CI, lockfiles, shell scripts, etc.
- Known limitation: check command runs in candidate-mutated repository
terminal_task
- Takes produced workspace directory as
candidate - Materializes evaluation inputs into workspace staging
- Runs trusted
eval.checker.commandin Docker withnetwork=none - Honors dangerous command policy for
eval.needed_commands
Writing custom verifiers safely
- Register the verifier in
verifiers/registry.pyfor yourtask_type. - Implement
Verifier.verify(); acceptverification_policyin**contextif you run sandboxed commands that may need dangerous command allowances. - Read secrets only from
task.hidden_payload()orresource_value(task, name)— never from candidate-controlled files without validation. - Treat candidate input as hostile:
- Text: parse defensively; do not
eval()candidate content - Code: prefer separate processes/interpreters (see code_completion gaps)
- Patches: validate paths before apply (follow
RepoPatchVerifierpatterns) - Workspaces: run checkers in fresh sandboxes; do not trust agent-modified test scripts
- Text: parse defensively; do not
- Use Docker with
network="none"unless the family explicitly requires network. - Return structured metadata without echoing hidden resource values; rely on
tester_run._redact_verifier_metadata()as a backstop, not the primary control. - Add tests under
tests/test_*_verifier.pycovering pass, fail, and tamper cases.
Optional injection for tests:
verifier = TerminalTaskVerifier(sandbox_factory=my_factory)Eval field visibility reference
From benchmark_compiler.EVAL_VISIBILITY:
| Family | Field | Visibility |
|---|---|---|
| multiple_choice | answer | hidden |
| short_answer | accepted_answers, tolerance | hidden |
| free_response | reference_answer, rubric | hidden |
| code_completion | tests | evaluation_inputs |
| code_completion | reference_solution, canonical_solution | hidden |
| repo_patch | tests | evaluation_inputs |
| repo_patch | gold_patch | hidden |
| terminal_task | checker, needed_commands, run_tests, test_files | evaluation_inputs |
| terminal_task | expected_state | hidden |
Unknown eval keys for a family default to hidden.
Helper functions for verifiers
In tasks.py:
resource_value(task, name, default=None)resource_text(task, name, default="")optional_resource_text(task, name)resource_tuple(task, name)
Use these rather than accessing raw JSON from candidate artifacts.
Related reading
- Benchmark Packs — row authoring
- Security Model — per-family threat notes
- Extending SecureBench — add families and verifiers
- Reference → Benchmark schema