Open Benchmark of AI Impact on Humans¶

A structured pipeline for running LLM behavioral benchmarks. It generates realistic simulated conversations, evaluates them against defined behavioral metrics, and aggregates results across models.

View on GitHub

The pipeline¶

Each benchmark defines a set of metrics: specific behaviors an AI assistant should or shouldn't exhibit. A run moves through five phases, each cached so an interrupted run resumes where it left off.

flowchart LR
    GM[gen_metrics] --> GS[gen_scenarios] --> S[simulate] --> E[evaluate] --> A[aggregate]

gen_metrics: generates metrics from a benchmark name and description.
gen_scenarios: writes adversarial test scenarios for each metric, then expands them across demographic variants.
simulate: a user model and a target model hold a conversation per scenario; the user model is adversarially prompted to probe for failures.
evaluate: an evaluator model scores each conversation. Is the target behavior present?
aggregate: pass rates and breakdowns are written to results.json.

Running it¶

uv sync && cp .env.example .env  # add API keys
python main.py <benchmark> all   # run all five phases
python main.py all               # all benchmarks x all targets

Use --config to point at a different config file. All run behavior (concurrency, force re-run, dry run) is set in config.yaml.

Code layout¶

lib/core/        pure functions: gen_metrics, generate, simulate, evaluate, aggregate
lib/pipeline/    phase runners: wire core logic with caching + concurrency
lib/task/        decorators: row_cache, concurrent, retry, write_json
prompts/         prompt templates for each LLM call
benchmarks/      one directory per benchmark, each with benchmark.yaml
main.py          CLI entry point
config.yaml      models, concurrency, generation settings, target list

Each pipeline phase wraps its core function in three decorators:

@concurrent(workers)   # fan out over rows in a thread pool
@retry(3)              # retry on failure with exponential backoff
@row_cache(cache_dir)  # skip rows already on disk
def step(row): ...

Every phase is resumable, parallel, and fault-tolerant with no phase-specific code for any of it. See The decorator stack.

Where to go next¶

How It Works: the design of multi-turn adversarial simulation.
Quickstart: install and run your first benchmark.
Configuration: all config options explained.
Writing a benchmark: define metrics and scenarios.
Internals: full code architecture.