Architecture¶
The engine is split into three layers, each with one job:
main.py # CLI + orchestration only
lib/
task/ # pipeline mechanics — knows nothing about benchmarks
cache.py # row_cache — file-based JSON cache decorator
retry.py # retry — exponential backoff decorator
concurrent.py # concurrent — thread-pool fan-out decorator
io.py # write_json — atomic file write
core/ # domain logic — pure functions, no I/O
client.py # LLM call + cost tracking via litellm
gen_metrics.py # metric generation from benchmark description
generate.py # scenario generation + demographic expansion
simulate.py # single conversation loop
evaluate.py # conversation scoring
aggregate.py # pass rates, grouping
prompts.py # template loader
cost.py # cost reporting
usage.py # Usage value type
pipeline/ # wires task mechanics + core logic together
utils.py # load_yaml, save_yaml
gen_metrics.py / gen_scenarios.py / simulate.py / evaluate.py / aggregate.py
Layer rules:
lib/core/: pure functions, no I/O, fully testable in isolation.lib/task/: pipeline mechanics, knows nothing about benchmarks.lib/pipeline/: composes the two; one file per phase.main.py: CLI flags, skip/force logic, orchestration; nothing domain-specific.
This is why lib/core/ can be unit-tested without mocking the world, and why
the mechanics (caching, retry, concurrency) are reusable across all five phases.