About TokenJam

What is TokenJam?

TokenJam is the cost-optimization layer for your AI agents. It reads your agent's telemetry and surfaces insights that reduce your AI bill, through five sub-products that each implement a well-researched cost-optimization technique. A full observability stack comes bundled too: traces, drift detection, budgets, sensitive-action alerts, and more.

More specifically:

TokenJam "Downsize": Walks every LLM call in your agent's trace history, classifies each session by structural shape, identifies candidates for cheaper models in the same family, and shows you the candidates with potential savings and example sessions to spot-check. Downsize never claims quality equivalence; it just surfaces the candidates, you make the call.
TokenJam "Trim": Runs every captured prompt through a local classifier (LLMLingua-2, ~280 MB, runs on your machine) that predicts which regions of the prompt the model likely ignores. Flags those low-significance regions as bloat candidates and shows you exactly which sections of your prompt template are safe to cut. No API calls, no per-run cost.
TokenJam "Cache": Two parts to this one. First, it shows your current caching ratio per (provider, model) so you can see how much of your spend is already cached. Second, it scans your real prompt history for stable prefixes that appear across many calls and suggests where to place cache_control breakpoints.
TokenJam "Script": Clusters your sessions by their (tool_name, argument_shape) signature. When the same sequence runs N+ times with zero branching, it flags those sessions as candidates for replacing an agent loop with a plain Python script. Deterministic work is way cheaper than re-deriving the same plan from an LLM every time.
TokenJam "Reuse": Clusters your sessions by plan shape and isolates the planning portion. When the same skeleton repeats across many runs with only small details changing, it flags the cluster as a reuse candidate and quantifies what you spent regenerating the plan, then exports the skeleton as a reviewable template you can serve from a cache or convert into a slash command.

Where does TokenJam get its data from?

From anywhere agent telemetry is already being collected. Today, that's 20+ integration surfaces across 15+ companies:

2 coding agents: Claude Code and Codex, zero-code (tj onboard --claude-code / --codex)
5 LLM provider SDKs: Anthropic, OpenAI, Gemini, Bedrock, and LiteLLM (which itself fans out to 100+ providers)
7 agent frameworks: LangChain, LangGraph, CrewAI, AutoGen, LlamaIndex, OpenAI Agents SDK, NemoClaw
5+ OpenTelemetry-native runtimes: Google ADK, AWS Strands, Microsoft Semantic Kernel, deepset Haystack, Pydantic AI (plus any other OTLP-compatible source)
2 observability platforms: backfill from Langfuse and Helicone (offline ingest of your historical data)

No telemetry? No problem.

TokenJam ships with its own observability stack, so you can drop in one of our SDK patches and start optimizing today. Everything runs on your machine: no cloud, no signup, no data egress.

Contact

TokenJam is founded and maintained by Anil Murty along with contributions from other open source developers. Anil can be reached via LinkedIn, X (formerly Twitter), or email at anil@metabldr.com. Follow TokenJam on LinkedIn and X.