Open source · the prove step
TokenJam Bench
Measure whether a cheaper model actually holds up.
A downsize recommendation is a hypothesis. Bench tests it: run the cheaper candidate against the original on real task suites and measure, per suite, whether accuracy holds, with real statistics rather than a vibe.
$ tjb run --suite humaneval,gsm8k --baseline opus --candidate haiku running 200 paired tasks per suite… humaneval · 200/200 gsm8k · 200/200 humaneval REGRESSED gsm8k HOLDS verdict → ./bench-report.json
TokenJam Bench · per-suite verdict
Why a benchmark, and why this one
Downsize surfaces cheaper-model candidates from your real usage. Whether a candidate is actually safe is a measurement question, not a guess. Bench answers it on task suites that look like your work, and reports the result per suite, never as one blended "quality" number.
Cost and accuracy, with the statistics
For each suite Bench runs the original and the candidate on the same prompts, then compares the paired outcomes. It reports the pass rate for each model with Wilson confidence intervals, and tests whether the difference is real with McNemar's test on the cases where the two models disagree.
A worked example shows why this matters. Downsizing from opus to haiku regressed on a coding suite (McNemar p ≈ 0.000) while holding on a math suite. The lesson is the whole reason Bench exists: blanket downsizing is unsafe, and the only way to know for your tasks is to measure them.
Pass rate per suite
Each suite gets its own pass rate and Wilson interval. A model that holds on math and breaks on code reads as exactly that, with no averaging to hide it.
Paired significance
McNemar's test on the disagreements tells you whether a drop is signal or noise, with a p-value you can cite instead of a single confidence percentage.
Run it yourself
Bench is open source. Install it, point it at a suite, give it the original and the candidate, and read the verdict.
$ pip install tokenjam-bench $ tjb run --suite humaneval --baseline opus --candidate haiku → pass rate + Wilson CI per suite, plus a McNemar p-value
Source: github.com/Metabuilder-Labs/tokenjam-bench · tokenjam-bench on PyPI. Pairs with the Downsize analyzer.