Open source · the prove step

TokenJam Bench

Measure whether a cheaper model actually holds up.

A downsize recommendation is a hypothesis. Bench tests it: run the cheaper candidate against the original on real task suites and measure, per suite, whether accuracy holds, with real statistics rather than a vibe.

View the repo PyPI

tjb

$ tjb run --suite humaneval,gsm8k --baseline opus --candidate haiku
  running 200 paired tasks per suite…
  humaneval · 200/200   gsm8k · 200/200
  humaneval REGRESSED   gsm8k HOLDS
  verdict → ./bench-report.json

TokenJam Bench · per-suite verdict

Pass rate · per suite

humaneval

71/89% regressed

gsm8k

94/95% holds

Run summary

200 paired / suite

p≈0.000 humaneval McNemar

1 / 2 suites held

haiku vs opus · Wilson CI · measured, never "certified"

Why a benchmark, and why this one

Downsize surfaces cheaper-model candidates from your real usage. Whether a candidate is actually safe is a measurement question, not a guess. Bench answers it on task suites that look like your work, and reports the result per suite, never as one blended "quality" number.

Cost and accuracy, with the statistics

For each suite Bench runs the original and the candidate on the same prompts, then compares the paired outcomes. It reports the pass rate for each model with Wilson confidence intervals, and tests whether the difference is real with McNemar's test on the cases where the two models disagree.

A worked example shows why this matters. Downsizing from opus to haiku regressed on a coding suite (McNemar p ≈ 0.000) while holding on a math suite. The lesson is the whole reason Bench exists: blanket downsizing is unsafe, and the only way to know for your tasks is to measure them.

Pass rate per suite

Each suite gets its own pass rate and Wilson interval. A model that holds on math and breaks on code reads as exactly that, with no averaging to hide it.

Paired significance

McNemar's test on the disagreements tells you whether a drop is signal or noise, with a p-value you can cite instead of a single confidence percentage.

Run it yourself

Bench is open source. Install it, point it at a suite, give it the original and the candidate, and read the verdict.

bench

$ pip install tokenjam-bench
$ tjb run --suite humaneval --baseline opus --candidate haiku
  → pass rate + Wilson CI per suite, plus a McNemar p-value

Source: github.com/Metabuilder-Labs/tokenjam-bench · tokenjam-bench on PyPI. Pairs with the Downsize analyzer.