Watching Claude Code with OTel: what Cursor and /cost won't show you

The Cursor-vs-Claude-Code framing is well-trodden ground at this point. There are excellent posts on the Arize blog and others comparing them on autonomy, code quality, latency, UX. We don’t need to redo that work.

What none of those posts engage with is the observability dimension. Which of these two tools lets you actually see what the agent did at 2am? Which one will tell you it spent $437 before the invoice does? The answer is asymmetric enough that it’s worth its own post.

The visibility surfaces, side by side

Both tools have built-in surfaces and external surfaces. Built-in is what ships in the product. External is what you can wire up.

SurfaceClaude CodeCursor Agent
Per-session cost view/cost (CLI), read-onlyIn-product usage tab
Persistent local cost historyNo (resets per session)Partial (in product, not on disk)
Real-time alert hooksNo (built-in)No
OTel telemetry pipeYes (CLAUDE_CODE_ENABLE_TELEMETRY=1)No public knob
Session JSON on diskYes (~/.claude/projects/<hash>/*.jsonl)Limited, internal
Tool-call attributionYes (gen_ai.tool.* attrs)Internal logs only
Behavioral drift detectionNone (built-in)None
Budget cap--max-budget-usd flagIn-product limit

The shape of the asymmetry is: Claude Code optimizes for terminal-native developers who’ll wire their own infrastructure. Cursor optimizes for IDE-native developers who want the visibility surface inside the IDE. Both are coherent product choices. Only one gives you a wire to monitor from outside.

What /cost actually shows you

The output of /cost inside Claude Code looks roughly like:

Total cost:    $0.43
Total duration (API):  1m 12.4s
Total duration (wall): 4m 28.1s
Total code changes:    3 files, 47 lines added, 12 removed
Usage by model:
  - claude-opus-4-7   12 calls   in 11.3k   out  4.2k    $0.41
  - claude-haiku-4    8 calls   in  2.1k   out  0.4k    $0.02

This is great for a spot-check while you’re sitting at the terminal. It is missing four things that turn out to matter:

  1. Persistence across sessions. Close the terminal, open a new one, /cost resets. There is no “what did Claude Code cost me this week” surface.
  2. Tool-level breakdown. You see cost by model. You don’t see cost by tool. The actual signal for retry loops is “this session called bash 200 times”. And /cost doesn’t surface tool call counts at all.
  3. The wire. It’s a slash command in a TUI. You can’t pipe it to anything. You can’t aggregate it. You can’t write an alert against it.
  4. The 2am case. You’re asleep. Nobody is typing /cost. The brake pedal has to live somewhere other than your fingers.

/cost is fine. It is also obviously not designed to be the only visibility surface for an autonomous agent.

What the OTel pipe actually emits

When you set CLAUDE_CODE_ENABLE_TELEMETRY=1 and point OTEL_EXPORTER_OTLP_ENDPOINT at any collector that speaks http/protobuf, Claude Code starts emitting OTel spans. The shape of the wire is roughly:

session start
├── claude.session  (root span)
│   ├── gen_ai.request    model=claude-opus-4-7
│   │     attributes:
│   │       gen_ai.system = "anthropic"
│   │       gen_ai.request.model = "claude-opus-4-7"
│   │       gen_ai.usage.input_tokens = 11420
│   │       gen_ai.usage.output_tokens = 412
│   │       gen_ai.response.model = "claude-opus-4-7"
│   │       gen_ai.response.finish_reasons = ["tool_use"]
│   ├── gen_ai.tool.call  tool=bash
│   │     attributes:
│   │       gen_ai.tool.name = "bash"
│   │       gen_ai.tool.arguments = "{...}"   (if capture on)
│   ├── gen_ai.tool.call  tool=read_file
│   ├── ... (many more) ...
│   └── claude.session.end
└── ... Next session ...

This matches the OTel GenAI semantic conventions closely enough that off-the-shelf OTel viewers (Phoenix, Jaeger, your own collector) render the spans usefully. It also matches the conventions LangChain / LangGraph / CrewAI emit through their own auto-instrumentation, which means you can aggregate across agents from different frameworks on the same backend.

A few things are worth flagging that the wire shape implies:

  • Tool calls are first-class spans. They aren’t buried in log lines or attribute blobs. You can count them, filter them, group by tool name.
  • Token counts come from the wire. You don’t have to estimate. The provider’s tokenization is what ends up on the span attribute.
  • Cost is not on the span. Claude Code emits tokens; cost calculation is on you (or on the collector). This is actually the right separation. Pricing changes; tokens don’t.
  • Session boundaries are explicit. A claude.session span wraps the whole run. Closing the parent span emits the end-of-session event, which is what we hook for cost_budget_session and drift_detected alerts.

This is enough to build everything /cost shows you, plus everything /cost doesn’t show you. The wire is the API.

What Cursor’s surface looks like by comparison

To be fair to Cursor: it has a usage view inside the IDE that gives you a cost-per-session breakdown that’s roughly as informative as /cost. The settings page exposes a budget cap that does the same thing as --max-budget-usd. For users whose workflow lives inside Cursor and who never run the agent unattended, these surfaces are appropriate to the product.

What’s missing is the external surface. There’s no documented environment variable that says “emit my agent’s tool calls to this OTLP endpoint.” There’s no per-tool span breakdown you can scrape. There’s no analog to the session JSON on disk that you can pipe to a script.

Concretely, the things you can do with Claude Code that you can’t easily do with Cursor:

  1. Run a separate process that watches every span in real time. Possible with Claude Code’s OTel pipe. Not exposed for Cursor.
  2. Compute cost across all sessions in a window. Possible with either, but only Claude Code makes the raw data trivially scriptable on disk.
  3. Detect retry loops automatically. Both could in principle. Claude Code’s wire makes it ~50 lines of Python. Cursor would require reverse-engineering internal log formats.
  4. Fire a Discord alert at 2am when a tool call has been repeating for 6 spans. Possible with Claude Code via any OTel-aware tool. Not possible with Cursor’s exposed surfaces.
  5. Build a behavioral baseline from the last N sessions of the same agent. Possible with Claude Code (and how we ship drift detection). Requires scraping internals on Cursor.

This isn’t a Cursor critique. It’s a recognition that IDE-embedded agents and terminal agents have different observability surface areas. If your workflow is unattended overnight runs, terminal agents are where the wire actually exists.

What the wire catches that /cost doesn’t

Here’s the specific value the OTel pipe provides over /cost, mapped to real horror-story shapes.

The $1,700 overnight ralph loop. GH #37686: “I tried a ralph loop overnight for the first time and woke up to $1700 worth of charges.” The shape: the same tool call repeated hundreds of times. /cost shows totals; doesn’t show the loop structure. The OTel wire emits each tool-call as a span. So a watcher running against the wire sees the same tool name + args hash repeating in window N and fires.

The $50/day silent generation. r/AI_Agents: “Logs said ‘done.’ Vault was empty.” The shape: lots of output tokens, no filesystem writes, no exit-code-0 tool returns. /cost reports the spend; doesn’t separate productive from unproductive spans. The OTel wire makes the join trivial. Span attributes tell you which spans had downstream side effects, you alert when the ratio collapses.

The 233 ghost subagents. r/ClaudeAI: “233 background agents I never asked for consumed 23% of my agent token spend.” The shape: subsessions spawning subsessions spawning subsessions. /cost shows you the bottom-line number; doesn’t show you the tree. The OTel wire emits parent-child span relationships. Count distinct child sessions per parent and you see the fan-out the in-product view hides.

The post-compaction rule collapse. GH #40801: “Compliance starts at 100% in the first 30 minutes → drops to 50–60% after first compaction → hits 20% after second compaction.” The shape: the agent is statistically a different agent after compaction events. /cost is a single number; doesn’t have any cross-session shape. The OTel wire gives you per-session statistics (token totals, durations, tool distributions) you can build a baseline from.

Claude Code’s own cost estimate being wrong. GH #34972: “Claude Code (Opus) repeatedly provided incorrect cost estimates… resulting in a spend of ~$1,700 against an estimate of ~$300.” The agent is unreliable about its own runtime. /cost shows you what the agent reports. The OTel wire has the raw token counts, so you can compute the cost independently from the agent’s self-report. The discrepancy is exactly the signal you want.

These are five distinct failure shapes. None of them are observable from /cost alone. All five are observable from the OTel wire with off-the-shelf rules.

The minimum collector setup

The lightest path from CLAUDE_CODE_ENABLE_TELEMETRY=1 to “alerts in Discord while it’s happening” is roughly:

# 1. Turn on the wire
export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_EXPORTER_OTLP_ENDPOINT=http://127.0.0.1:7391
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf

# 2. Run a collector that does something with the spans.
#    Options, in order of operational tax:
#    - Local CLI daemon (tj / claude-usage variants): one binary
#    - Self-hosted Langfuse: Docker Compose + ClickHouse
#    - Hosted dashboard (LangSmith, Langfuse Cloud): cloud account + bearer token

# 3. Wire alerts against the span stream
#    - retry_loop: 4 identical gen_ai.tool.call spans in last 6
#    - cost_budget_session: accumulated cost > threshold
#    - sensitive_action: tool args match dangerous-pattern set
#    - drift_detected: session stats outside ±3σ from baseline

# 4. Test the wire actually reaches your phone
#    Fire a synthetic alert; confirm Discord push notification arrives

The first three lines are the entire Claude Code-side change. Everything else is downstream of the wire, and the wire format is the same for any OTel-aware backend.

If you want the shortest-possible-distance version we ship as tj:

pip install tokenjam
tj onboard --claude-code      # writes config, daemonizes, sets the env vars
tj demo run retry-loop        # synthetic alert in ~60 seconds

That writes ~/.tj/config.toml with a random ingest secret, installs a launchd/systemd daemon, sets CLAUDE_CODE_ENABLE_TELEMETRY=1 and the OTLP endpoint in your Claude Code settings, and registers an MCP server so Claude Code can query its own observability mid-run.

If you’d rather assemble it from parts. Langfuse self-hosted, Phoenix in a notebook, a custom OTel collector. Go for it. The wire format is the same, which is most of the point.

”But Cursor has been ignoring this on purpose”

A reasonable response to all of the above: maybe Cursor’s choice not to expose an OTel wire is correct for their audience. If you’re an IDE-embedded developer who never runs the agent unattended, the in-product view is sufficient and exposing an OTLP endpoint is operational complexity you don’t need.

That’s plausible. We’re not arguing Cursor should ship OTel; we’re arguing that if your workflow is unattended Claude Code, Codex CLI, or any ralph/--dangerously-skip-permissions loop, the OTel wire is the thing that makes the unattended part safe. The asymmetry is workflow-dependent, not value-dependent.

The corollary worth saying out loud: if you’re using Cursor for interactive coding and Claude Code for unattended work, the observability story is only asymmetric on the unattended side. Cursor’s in-product view is plausibly enough for the interactive lane. The wire matters where the human isn’t there.

What Codex CLI looks like in this picture

Briefly, because it comes up: Codex CLI emits OTel similar to Claude Code. The variable names are slightly different (CODEX_ENABLE_TELEMETRY=1 for the analogous toggle, last we checked), the span schema is OTel GenAI-ish but with provider-specific extensions, and tj onboard --codex handles the bootstrap the same way it does for Claude Code. If you’re running both, the same collector handles both wires.

LangChain, LangGraph, CrewAI, AutoGen. Same story. The auto-instrumentation in the standard Python OTel libraries emits gen_ai.* attributes. Anything OTLP-compliant lands on the same backend.

The honest limitations

Three caveats worth being clear about.

1. Tool-output capture is opt-in. By default gen_ai.tool.output is not on the span (in Claude Code’s emission and in TokenJam’s default capture config). This is intentional for privacy / disk-cost reasons. It does mean that some forensics workflows (schema validation, openevals export) need an explicit flip. We’re tracking the right default. It’s a real open question.

2. The wire has gaps. Not every OTel GenAI semconv attribute is emitted by every agent. gen_ai.response.finish_reasons is reliable on Claude Code; gen_ai.usage.cache_read_input_tokens is patchier. If you’re computing cost off the wire, you have to handle missing attributes gracefully.

3. Sleeping laptops don’t emit spans. If the machine running Claude Code is asleep, nothing on the machine is running, including your monitor. Run unattended agents on a machine that stays awake. caffeinate -i on macOS, the equivalent suspend-inhibit on Linux, or just a cloud VM. The wire only helps if the wire is live.

These are not deal-breakers. They are the kinds of things that look obvious in hindsight and bite people in real time.

The wider point

Claude Code shipped an OTel pipe and it is, as of mid-2026, the single most important thing the product did for unattended-agent safety. It turned the agent from a black box into something you can build infrastructure around. The five failure shapes the community keeps publishing. Overnight retry loops, silent generations, ghost subagents, post-compaction drift, agent-self-report discrepancies. All became externally observable in one move.

Cursor hasn’t made that move. That’s a workflow choice, and it’s defensible for IDE-embedded interactive use. It also means that if your workflow involves the agent running while you’re not looking, the answer to “how do I monitor this” is meaningfully different depending on which tool you picked.

/cost is a meter. The OTel wire is the wire. They are not the same thing. The community has been confusing them for long enough that the confusion has cost real money. The fix isn’t a better dashboard. It’s a better understanding of what each surface actually emits.

export CLAUDE_CODE_ENABLE_TELEMETRY=1
pip install tokenjam
tj onboard --claude-code

Common questions

Does Cursor have any equivalent to CLAUDE_CODE_ENABLE_TELEMETRY?
Not at the level Claude Code does, as of mid-2026. Cursor has an in-product usage view and a budget cap setting, but there's no documented public knob to redirect tool-call and LLM-call spans to an arbitrary OTLP endpoint. If you need real-time external observability on an unattended autonomous loop, that's a meaningful difference between the two tools.
Can I get the same data from Claude Code's session JSON files instead?
Partially. Claude Code writes session transcripts to ~/.claude/projects/<hash>/*.jsonl. You can parse those for cost and tool-call attribution after the fact. The OTel wire gives you the same data in real time, which is the difference between forensics (the next morning, after the bill arrived) and alerting (now, while it's still fixable). Most people end up using both. The JSONL files for post-hoc analysis and the OTel wire for runtime alerts.
What happens to the wire when I'm on a flight without internet?
If your collector is local (TokenJam, claude-usage, Phoenix-in-notebook), nothing changes. The spans flow to the local DuckDB or SQLite. If your collector is a cloud dashboard (LangSmith, Langfuse Cloud, Helicone) and you're offline, spans buffer or drop depending on the SDK's retry logic. This is the reason we ship local-first: the wire and the storage live on your machine, so the network can be anything.
Does Cursor support Claude Opus and other models the same way Claude Code does?
Yes. Both Cursor and Claude Code can use Anthropic, OpenAI, and other provider models. The model choice isn't the asymmetry. The asymmetry is observability surface area: Claude Code exposes OTel spans for tool calls and LLM calls; Cursor exposes the usage view in the IDE. Same models, different visibility surfaces around them.
Is the OTel wire the same as what LangSmith / Langfuse / Helicone consume?
Yes. They all speak OTLP. You can point Claude Code's OTLP endpoint at LangSmith's ingestion URL, at a Langfuse instance, at Helicone's gateway, at Phoenix, or at TokenJam. Same wire format. The downstream UX, cost, and alerting capabilities differ. But the integration on the Claude Code side is identical. Pick the collector based on what you need to do with the spans, not based on the wire.
What's the privacy posture on the wire?
Out of the box, Claude Code emits span metadata (model names, token counts, tool names) but not tool inputs/outputs unless you explicitly enable content capture. Whether the *content* of tool inputs and outputs lands on the wire depends on the capture configuration of your downstream collector. If you're privacy-sensitive, local-first collectors (DuckDB on your laptop) are the lowest-exposure choice. Nothing leaves the machine.

Sources

Further reading