Optimization technique

TokenJam Downsize

Pay Opus prices for Opus work, not for everything else.

Find tasks where a cheaper model would suffice.

tj optimize

$ pipx install tokenjam$ tj onboard --claude-code  → Detects Claude Code installation  → Reads ~/.claude/projects/<project>/<session>.jsonl files (last 30d)  → Ingests ~9,800 spans across ~247 sessions into the local DuckDB  Done. 4M tokens analyzed, $4,287 implied API value, 23 days of history. $ tj optimize Analyzing 247 sessions, 9.8K spans, 4.0M tokens (last 30d, claude-code, api)…   ① Model downgrade: 47% of sessions look Haiku-eligible     • 116 of 247 sessions match structural heuristics:       short input (<5K tokens), short output (<500 tokens), ≤5 tool calls     • Currently running on: claude-opus-4-8, claude-sonnet-4-6     • Suggested target: claude-haiku-4-5     • At current usage: $1,886 actual → $254 estimated (-86% on flagged)     • Estimated monthly savings: $1,632 (-38% total)      ⚠ Structural heuristic only. Run `tj optimize --validate` to replay-test       on your actual sessions before applying.

The problem

Coding agents default to premium models on every call. Most calls don't need it. A "fix this typo" task gets routed to Claude Opus 4 the same way a "refactor the entire authentication module" task does.

The structural shape of those two requests is wildly different. The first one runs fine on Haiku for a fraction of the price. Downsize finds the calls that look like the cheap kind, computes what they would have cost on a smaller model, and shows you the difference, grounded in your own session history rather than a vendor benchmark.

How it works

Walk every LLM call in your captured trace history. Classify each session by structural shape: input token count, output token count, tool-call count, single-turn vs multi-turn.

Sessions matching a known small-task pattern get flagged as candidates for a cheaper model. Compute what those sessions would have cost on the target model and report the difference. The recommendation is a structural heuristic, not a quality claim, so you decide whether to apply it.

Prove it before you switch

A downsize candidate is a structural heuristic, not a guarantee. Before you move a task to a cheaper model, measure it: TokenJam Bench runs the candidate against the original on real task suites and reports, per suite, whether accuracy holds, with Wilson confidence intervals and a McNemar p-value. Measured, never "certified". You decide what to apply.

What you do with it

Recommendations land in your existing tools: the terminal, an MCP-capable agent, or an exportable config.

CLI
tj optimize downsize
MCP
should_downgrade_for_task
query from inside any MCP-capable agent
Export
tj optimize export-config --target claude-code-settings
drops into your Claude Code settings.json

The research behind it

FrugalGPT

Chen, Zaharia, Zou — Stanford 2023

Introduced the cascade-confidence framework that grounds the validation work.
RouteLLM

LMSys — ICLR 2025

Matrix-factorization router with transfer properties. Our primary routing engine; MIT-licensed.
BFCL — Berkeley Function-Calling Leaderboard

Patil et al. — ICML 2025

AST-based evaluation of tool-call structure without execution. Makes replay validation cheap for coding agents.

Downsize is in the open-source CLI. Install once, analyze everything.

Get started GitHub

The problem

How it works

Prove it before you switch

What you do with it

The research behind it

FrugalGPT

RouteLLM

BFCL — Berkeley Function-Calling Leaderboard

Downsize is in the open-source CLI. Install once, analyze everything.