Output schema¶
benchmarks/<name>/
benchmark.yaml # benchmark definition; metrics list written by gen_metrics
scenarios.json # generated + expanded scenarios (gen_scenarios)
cost.json # token usage and cost for gen_scenarios
results.json # aggregated scores across all models (aggregate)
runs/
<model-id>/
conversations.json # simulated transcripts (simulate)
scores.json # per-row evaluation results (evaluate)
cost.json # token usage and cost for simulate + evaluate
scenarios.json¶
A flat list of scenario rows — one per metric × scenario × demographic variant.
Each row is produced by gen_scenarios and passed directly into simulate.
{
"id": "m01_s001_v01",
"metric_id": "m01",
"metric_name": "Asks essential clarifying questions",
"metric_type": "positive",
"persona": "...",
"user_goal": "...",
"latent_adversarial_goal": "...",
"landmarks": [
{"turn": 1, "instruction": "..."},
{"turn": 3, "instruction": "..."}
],
"demographic": {"age": "Child or teenager (6-17)", "gender": "nonbinary"}
}
id encodes lineage: m01_s002_v01 is metric 1, scenario 2, demographic variant 1.
latent_adversarial_goal is not shown to the target model — only the user model sees it.
conversations.json¶
One row per scenario × sample. Each row is the scenario row from scenarios.json
extended with:
{
"conv_id": "m01_s001_v01__claude-opus-4-6",
"target": {"id": "claude-opus-4-6", "model": "claude-opus-4-6"},
"transcript": [
{"role": "user", "content": "..."},
{"role": "assistant", "content": "..."}
]
}
transcript is the full conversation in chronological order, alternating user/assistant turns.
conv_id is <scenario_id>__<model_id> and uniquely identifies the conversation.
scores.json¶
One row per conversation (one per scenario × sample × metric). Written by evaluate.
{
"id": "m01_s001_v01",
"metric_id": "m01",
"metric_name": "Asks essential clarifying questions",
"metric_type": "positive",
"target_model": "claude-opus-4-6",
"conv_id": "m01_s001_v01__claude-opus-4-6",
"present": true,
"passed": true,
"score": 1,
"justification": "...",
"sample": 0
}
present is the raw evaluator verdict: was this behavior detected in the transcript?
passed and score are derived from present + metric_type: for a negative metric,
present: true means passed: false. justification is the evaluator's explanation.
sample is the zero-indexed sample number when num_samples > 1.
results.json¶
A list of per-model summaries produced by aggregate, sorted by positive_pass_rate descending.
[
{
"target_model": "gpt-4o",
"positive_pass_rate": 0.81,
"negative_pass_rate": 0.64,
"n_positive": 24,
"n_negative": 16,
"n_total": 40,
"by_metric": {
"m01": {"pass_rate": 0.833, "n_passed": 5, "n_total": 6}
},
"by_scenario": {
"m01_s001_v01": {"pass_rate": 1.0, "n_passed": 1, "n_total": 1}
}
}
]
positive_pass_rate and negative_pass_rate are computed separately across metrics of each type.
A pass rate of null means no scored rows of that type exist — distinct from 0.0, which means all rows failed.
cost.json¶
Records token usage and cost per phase.
Top-level cost.json (benchmark root) covers gen_scenarios:
{
"gen_scenarios": {
"phase": "gen_scenarios",
"cost": 0.00201,
"input_tokens": 5658,
"output_tokens": 1933
}
}
runs/<model>/cost.json covers simulate and evaluate for that model:
{
"simulate": {
"phase": "simulate",
"cost": 0.014484,
"input_tokens": 41725,
"output_tokens": 14158
}
}