<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>TokenJam Blog</title><description>Researching the agentic AI ecosystem.</description><link>https://tokenjam.dev/</link><language>en-us</language><item><title>What is Agent Memory and why does it matter?</title><link>https://tokenjam.dev/blog/2026-05-13-agent-memory/</link><guid isPermaLink="true">https://tokenjam.dev/blog/2026-05-13-agent-memory/</guid><description>How AI agents persist state across sessions, why memory is different from RAG, and the open-source projects building this layer.</description><pubDate>Wed, 13 May 2026 00:00:00 GMT</pubDate><content:encoded>import TLDR from &apos;@/components/TLDR.astro&apos;;
import FAQBlock from &apos;@/components/FAQBlock.astro&apos;;

&lt;TLDR&gt;
- Agent memory is persistent state (facts, decisions, relationships) that an AI agent carries across sessions and beyond a single context window.
- Memory is not RAG. RAG retrieves static documents; memory is dynamic state the agent writes to as it learns.
- The hard parts: bounded context windows, approximate semantic recall, and deciding what to remember versus forget.
- Four useful categories: short-term (in-context), long-term (persisted), semantic (facts), and episodic (time-stamped events).
- The active projects: Letta (tiered OS-style memory), Mem0 (vector + graph hybrid), Zep (temporal knowledge graphs), LangMem, Cognee, Supermemory.
&lt;/TLDR&gt;

**Agent memory** is the persistent state an AI agent maintains across sessions and beyond the LLM&apos;s context window. It stores facts the agent has learned, decisions it has made, and relationships it has tracked, so a future interaction can retrieve and act on them. Without memory, every session starts from zero.

## Why it matters

A stateless LLM forgets everything when the conversation ends. That is fine for one off questions, similar to a search engine, but in order for AI agents to be more useful than just a search engine they need to be able to track a multi-week conversation and progress towards a goal. They also need to improve over time based on what they learn from the user.

Memory is what turns a chatbot into something you can hand ongoing work to. It is also where the hardest unsolved problems live: how to compress conversations into useful facts, how to retrieve the right fact at the right time, and how to handle the moment when a user&apos;s stated preference today contradicts what they said three months ago.

## Why memory is hard

Three problems make this a live research area.

**Context windows are bounded.** Claude Sonnet 4.5 has a 200K context. GPT-5 reaches 400K. Even at the high end, an agent serving one customer over six months accumulates more conversational data than any context can hold. You cannot just stuff history into the prompt.

**Semantic recall is approximate.** Vector embeddings let you ask &quot;find facts similar to this query,&quot; but the result quality depends on phrasing, embedding model, and how facts were chunked when stored. Multi-hop reasoning (&quot;connect fact A and fact B to answer question C&quot;) and temporal reasoning (&quot;was that true last month?&quot;) both stress current approaches. Graph-based memory helps with multi-hop questions, at the cost of curating structure from unstructured chat.

**Deciding what to forget is itself a design problem.** Should the agent store every word, or distill summaries? When a user contradicts an earlier preference, do you delete the old fact, mark it invalid with a timestamp, or keep both and let retrieval pick? There are no universal answers. The right policy depends on whether you are building a personal assistant, a customer-support agent, or a coding agent that needs to remember repo conventions.

## Categories of memory

Memory systems organize knowledge along two axes: temporal scope (within a session or across sessions) and representation (what form the knowledge takes).

### Short-term and long-term

**Short-term memory** lives in the LLM&apos;s context window. It is the transcript of the current exchange. Cheap to implement, capped by context size, and gone when the session ends.

**Long-term memory** persists outside the context window in a database, vector store, or knowledge graph. The agent compresses short-term context into long-term facts before a session ends, then retrieves the relevant slice in the next session.

### Semantic and episodic

**Semantic memory** holds knowledge without a timestamp: &quot;this user prefers dark mode,&quot; &quot;the team lead is Sarah,&quot; &quot;our API rate limit is 1000 req/sec.&quot; It answers &quot;what is true&quot; questions. Vector indexes and knowledge graphs are the usual representations.

**Episodic memory** is tied to time and context: &quot;on 2026-04-12 the user reported a checkout bug,&quot; &quot;in session 147 the agent escalated to a human.&quot; It answers &quot;what happened&quot; questions and underwrites causal reasoning. Event logs or timestamped graph edges are typical.

Production systems blend both. Zep tracks when facts were true. Mem0 combines vector retrieval with graph relationships. Letta tiers everything through an OS-style hierarchy.

## Memory is not RAG

This is the distinction worth being precise about, because the two get conflated constantly.

**RAG (retrieval-augmented generation)** reads from a fixed external corpus: a product manual, a docs site, a corpus of papers. The LLM consults that corpus at inference time. It does not write to it. The corpus is authoritative; the agent is a reader. RAG is excellent for &quot;what is the API rate limit?&quot; because the answer lives in one place and does not change based on conversation.

**Agent memory** is bidirectional. The agent writes facts during conversations (&quot;the user prefers tea&quot;), reads from memory to personalize responses, and updates memory when facts change. Memory is about the agent&apos;s own accumulated experience, not an external reference. An agent serving the same customer five times hits the same product docs each visit via RAG, and recalls what the customer asked about last time via memory.

The [xMemory paper](https://arxiv.org/abs/2602.02007) put it this way: RAG targets large heterogeneous corpora with diverse passages; agent memory deals with bounded, coherent dialogue streams whose spans are highly correlated. Most production agents use both. RAG for reference knowledge, memory for personalization and continuity.

## Notable projects

The agent memory space matured fast across 2024 and 2025. Here are the systems worth knowing.

### Letta (formerly MemGPT)

[Letta](https://www.letta.com/) grew out of the [MemGPT](https://arxiv.org/abs/2310.08560) research project from UC Berkeley. MemGPT proposed a tiered architecture borrowed from operating systems: a small &quot;core&quot; context that acts like CPU cache, an &quot;archival&quot; store that acts like RAM, and a vector index for semantic retrieval. The agent decides what to keep in core context and what to push to archival, writing explicit calls like `core_memory_replace()` as part of its action loop. Letta now offers a framework for building, inspecting, and deploying agents with multi-level memory, with both open-source and managed deployment paths.

### Mem0

[Mem0](https://mem0.ai/) is a drop-in memory layer with a hybrid architecture: vector store for semantic search, graph store for relationship reasoning, key-value store for direct lookups. The platform extracts facts from conversations automatically, classifies them, and routes them to the appropriate backend. Storage is pluggable (Pinecone, Neo4j, others). Mem0 also publishes research on memory-aware LLM evaluation.

### Zep

[Zep](https://www.getzep.com/) built [Graphiti](https://github.com/getzep/graphiti), a temporal knowledge graph engine that tracks not just facts but when those facts were true. Graphiti uses a bi-temporal model: transaction time (when the fact was learned) and valid time (when the fact was true in the world). That lets agents query historical state and avoid the &quot;user once said coffee, now says tea&quot; contradiction problem. Zep reports strong results on the Deep Memory Retrieval benchmark relative to MemGPT.

### LangMem

[LangMem](https://www.langchain.com/blog/langmem-sdk-launch) is LangChain&apos;s lightweight SDK for long-term memory in LangGraph agents, released in early 2025. It ships pre-built tools for extracting procedural, episodic, and semantic memories, a background manager that consolidates memories over time, and integration with LangGraph&apos;s long-term memory store. Storage-backend-agnostic, which makes it a reasonable choice if you are already invested in LangChain.

### Cognee

[Cognee](https://www.cognee.ai/) frames itself as a memory control plane: a unified layer for building knowledge graphs from conversational data. Cognee ingests from 30+ sources (Notion, Slack, email, S3), enriches with embeddings and relationship extraction, and exposes four operations: remember, recall, forget, improve. The &quot;memify&quot; process continuously prunes stale knowledge and strengthens frequently-used connections.

### Supermemory

[Supermemory](https://supermemory.ai/) combines a custom vector-graph engine with ontology-aware edges, hybrid vector and keyword search, and automatic ingestion from common tools (Gmail, Drive, Slack). It ranks #1 on three benchmarks: LongMemEval, LoCoMo, and ConvoMem. Also ships a browser extension and an MCP server, which makes memory accessible to any compatible agent.

## Evaluating memory

How do you measure whether an agent is remembering the right things? The honest answer: poorly, and the field knows it.

[LongMemEval](https://arxiv.org/abs/2410.10813), published in 2024, was the first serious attempt. It tests five abilities: information extraction (recalling specific facts from long histories), multi-session reasoning (synthesizing across separate conversations), temporal reasoning (understanding when things happened), knowledge updates (correcting itself when facts change), and abstention (knowing what it does not know). The benchmark embeds 500 curated questions in realistic chat histories spanning 115K tokens at the short end and up to 1.5M tokens at the long end. Even GPT-4o lands around 30 to 70 percent accuracy depending on the slice, which gives you a sense of how unsolved this is.

LoCoMo and ConvoMem cover overlapping ground from different angles. None of them measures usefulness in production, where the question is whether memory actually improved the user experience, not whether retrieval was technically correct.

In practice, teams evaluate memory through retrieval accuracy (did the system return the fact you stored?), behavioral change (did the agent&apos;s next response reflect what it learned?), temporal consistency (after a contradiction, does the agent know the current truth?), and context efficiency (did memory reduce the need to pass long history every turn?). Observability tools like [LangSmith](https://smith.langchain.com/) can log memory operations. Automated evaluation of what *should* have been remembered remains mostly manual.

## Common questions

&lt;FAQBlock items={[
  {
    question: &quot;Isn&apos;t memory just RAG with a vector store?&quot;,
    answer: &quot;No. RAG reads from a fixed external corpus. Memory is dynamic state the agent writes to as it learns. You can build memory on top of a vector store, but the distinction is direction: RAG is read-only against authoritative content; memory is read-write against the agent&apos;s own experience. Production systems use both, for different jobs.&quot;
  },
  {
    question: &quot;Do I need memory for a short-running agent?&quot;,
    answer: &quot;Probably not. If your agent handles single-turn or within-session interactions, short-term context history is enough. Long-term memory pays off when the agent needs to recognize returning users, track multi-session goals, or adapt to a specific person over time. A chatbot handling 100 independent queries a day does not need it. A personal assistant working with the same user for weeks does.&quot;
  },
  {
    question: &quot;Can I just use a 1M-token context window instead of building memory?&quot;,
    answer: &quot;Not for long-running agents. A 200K or 400K context sounds large until you do the math: six months of daily conversations with one user runs into millions of tokens. Stuffing all of it into every call is expensive and wasteful, because most of it is irrelevant to the current turn. Memory systems exist to retrieve the right slice. Long context and memory are complements, not substitutes.&quot;
  },
  {
    question: &quot;How do I evaluate whether my memory system is working?&quot;,
    answer: &quot;Start with a manual test loop. Insert a fact via the agent, pause, query memory directly to confirm storage, then resume the agent in a fresh session and ask about that fact. If it recalls correctly, retrieval works end to end. Then add harder cases: multi-hop queries that require combining two facts, temporal queries that ask whether something was true at a specific time, and behavioral checks that test whether agent decisions actually shift based on memory. Formal benchmarks like LongMemEval exist if you want to compare across systems, but they require non-trivial setup.&quot;
  },
  {
    question: &quot;Vector, graph, or hybrid? How do I choose?&quot;,
    answer: &quot;Start with vectors. They are simpler and fast, and most queries are &apos;find facts similar to this.&apos; Add graph reasoning if you discover you have multi-hop questions that vectors handle badly: &apos;find people this user knows who work in fintech.&apos; Hybrid systems like Mem0, Zep, and Cognee combine both. Pick a hybrid system from day one if you already know your queries are relationship-heavy.&quot;
  }
]} /&gt;

## Further reading

- [MemGPT: Towards LLMs as Operating Systems](https://arxiv.org/abs/2310.08560).
- [Zep: A Temporal Knowledge Graph Architecture for Agent Memory](https://arxiv.org/abs/2501.13956).
- [LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory](https://arxiv.org/abs/2410.10813).
- [RAG is not Agent Memory](https://www.letta.com/blog/rag-vs-agent-memory) (Letta blog).</content:encoded></item><item><title>What is agent evaluation?</title><link>https://tokenjam.dev/blog/2026-05-12-agent-evaluation/</link><guid isPermaLink="true">https://tokenjam.dev/blog/2026-05-12-agent-evaluation/</guid><description>Agent evaluation: measuring multi-step trajectories, tool use, and open-ended outputs. Why benchmarks alone don&apos;t tell you whether an agent works in production.</description><pubDate>Tue, 12 May 2026 00:00:00 GMT</pubDate><content:encoded>import TLDR from &apos;@/components/TLDR.astro&apos;;
import FAQBlock from &apos;@/components/FAQBlock.astro&apos;;

&lt;TLDR&gt;
- Agent evaluation splits across four categories on two axes: pre-built benchmark tasks vs. your own tasks, and OSS tooling vs. managed platforms.
- Capability benchmarks (GAIA, SWE-bench, WebArena, OSWorld) and domain benchmarks (Vals AI, Coval) bring pre-built tasks. OSS frameworks (Inspect AI, DeepEval) and commercial platforms (Braintrust, Galileo) let you write your own.
- LLM-as-judge is cheap and fast. It fails on subjective quality, novel reasoning, and on grading itself.
- All major agent benchmarks were shown reward-hackable to near-perfect scores (UC Berkeley, April 2026).
- The four eval categories rarely connect to each other or to production traces. Bridging offline eval and live behavior is where the category is currently weakest.
&lt;/TLDR&gt;

**What is agent evaluation?** Agent evaluation is the practice of measuring whether an AI agent does what it&apos;s supposed to do, repeatedly, on diverse inputs: across multi-step trajectories, tool use, and open-ended outputs that traditional ML evaluation doesn&apos;t capture. Single-turn language model evaluation grades one output. Agent evals must verify that an agent can navigate complex environments and call the right tools at the right time. They also need to confirm the agent recovers from failures in its own reasoning.

## Why traditional ML evaluation falls short

Traditional machine learning evaluation was built around static inputs and single-turn outputs. A classifier either predicts the correct label or it doesn&apos;t. An LLM either generates the right summary or it doesn&apos;t.

Agents are different. An agent&apos;s behavior unfolds over many steps: it observes an environment, makes a decision, takes an action, observes the result, and repeats. A single step can be correct while the overall trajectory fails. An agent might call the right tool and pass it the wrong arguments. It might retrieve information correctly and then fail to synthesize it into an answer. It might get stuck in a loop and never terminate.

Agents need their own evaluation regime. Traditional eval frameworks miss what matters:

- **Multi-step trajectories:** A single correct output isn&apos;t enough if the path to get there involved hallucinating intermediate steps, calling tools redundantly, or exploring dead ends.
- **Tool use:** Did the agent call the tool? Did it use the output? Did it handle errors gracefully? Did it know when *not* to use a tool?
- **Open-ended outputs:** Many agent tasks don&apos;t have a single correct answer. Evaluation must grade on relevance and task completion, not on string matching.

## Categories of evaluation

Agent evaluation splits into four categories along two rough axes: whether the tasks come pre-built or you write your own, and whether the tooling is public/OSS or commercial/managed. Most production teams use two or three of these at once.

|  | Pre-built tasks (run *their* tests against your agent) | Your own tasks (write the tests yourself) |
|---|---|---|
| **General-purpose / OSS** | Capability benchmarks | OSS eval frameworks |
| **Specialized / Managed** | Domain benchmarks | Commercial eval platforms |

### Capability benchmarks

Capability benchmarks are pre-built, general-purpose task suites designed to measure what agents can do at all. You point your agent at them and get a score comparable to public leaderboards:

- **GAIA:** 466 reasoning tasks (economic data lookup, currency conversion, multi-step logic) with access to web tools, file parsers, and calculators. Answers are graded as exact string matches against human-annotated ground truth.
- **SWE-bench:** 2,294 GitHub issues across 12 popular Python repositories. The benchmark measures whether agents can identify bugs, propose fixes, and pass test suites. SWE-bench Verified, a human-validated subset, contains 500 curated samples.
- **WebArena:** A self-hostable sandbox with websites mimicking real services (e-commerce, map search, content management). Agents control a simulated browser to complete tasks like booking flights or updating accounts.
- **OSWorld:** Agents receive a desktop screenshot and a natural language instruction. They must interact via mouse and keyboard, producing new screenshots with each action. The benchmark tests GUI understanding and navigation.
- **TAU-bench:** Simulates customer service interactions (airline, retail, telecom, manufacturing) where agents must handle real-time conversations, use domain-specific APIs, and follow business guidelines. Includes both text and voice modalities.
- **TerminalBench:** Agents interact with a shell environment, executing commands and navigating file systems to complete coding and system administration tasks.

### Domain benchmarks

Domain benchmarks are pre-built test suites specialized for industries where the stakes (and the task texture) differ from general-purpose evals:

- **Vals AI:** Benchmarks legal and financial AI agents on domain-specific tasks. The legal evaluation tests document Q&amp;A, redlining, transcript analysis, and legal research. The finance benchmark evaluates agents on 537 questions covering retrieval, market research, and financial modeling.
- **Coval:** An evaluation platform for voice and chat agents that simulates thousands of real-world scenarios and measures performance on domain-specific metrics (latency, conversation quality, goal completion) alongside voice-specific measurements (STT accuracy, TTS clarity).

### Open-source eval frameworks

OSS eval frameworks don&apos;t ship benchmark tasks. They ship the machinery (test runners, scoring functions, LLM-judge helpers) so you can write evals for *your* use case in code you control:

- **Inspect AI** (UK AI Safety Institute): A Python framework for scripted evals with tool calls and model-graded rubrics. Includes 200+ pre-built evaluations ready to run on any model.
- **DeepEval:** An LLM-evaluation framework similar to Pytest, offering metrics like G-Eval, task completion, answer relevancy, and hallucination detection. Runs locally without external services.
- **Promptfoo:** A CLI and Node.js tool for testing and red-teaming LLM applications. Tests are defined declaratively in YAML, making it easy to compare models and harden prompts against adversarial inputs.
- **RAGAS:** A Python library for evaluating retrieval-augmented generation (RAG) pipelines. Provides reference-free metrics for retrieval quality and generation quality, integrating with LangChain and other frameworks.

### Commercial eval platforms

Commercial eval platforms cover the same &quot;you bring the tasks&quot; workflow as OSS frameworks, with managed infrastructure, dashboards, dataset versioning, and CI/CD integration:

- **Braintrust:** A managed eval and observability platform with SDK wrappers for the OpenAI Agents SDK, LangGraph, LangChain, and CrewAI. Bundles eval definitions, scoring, and trace storage in a single hosted product.
- **Galileo:** Focuses on building a reliability stack for complex agents. Its Luna evaluation models compress expensive LLM-as-judge evaluators into compact models that run at sub-200ms latency with significantly lower cost.
- **Maxim:** An evaluation and observability platform emphasizing agent simulation. Teams can simulate agent behavior across hundreds of scenarios before production, then monitor quality in real time after deployment.
- **Patronus:** Provides runtime guardrails and evaluation for production agents, focusing on safety and compliance.
- **Confident AI:** An LLM evaluation framework that specializes in LLM-as-judge grading and evaluation pipeline management.

A practical observation cuts across all four categories: they don&apos;t connect to each other. Capability benchmark scores rarely flow into production monitoring. Production observability rarely surfaces eval regressions. Domain benchmarks live in their own dashboards. Commercial eval platforms increasingly bundle eval and production tracing, and even there most teams still wire the bridge themselves with scripts and shared spreadsheets.

## The LLM-as-judge pattern

LLM-as-judge uses an LLM (often with an evaluation prompt and rubric) to grade agent outputs. It&apos;s fast to set up and works well in specific domains.

**Where it works:**

- Factual correctness (e.g., &quot;Is this fact accurate?&quot;)
- Format compliance (e.g., &quot;Does the output follow the required schema?&quot;)
- Simple preference comparisons (e.g., &quot;Is response A better than response B?&quot;)

**Where it fails:**

- Subjective quality judgments (what constitutes a &quot;good&quot; explanation is context-dependent)
- Novel reasoning (judges can&apos;t reliably grade reasoning steps they don&apos;t understand)
- Multi-step coherence (judges may miss subtle logical inconsistencies)
- Grading the graders (different LLMs disagree on grades, introducing inconsistent scores)

The honest tradeoff: LLM-as-judge is cheap and fast, and it introduces a new failure mode. If your judge is miscalibrated, you&apos;ll optimize your agent toward the judge&apos;s biases, not toward the actual task. Particularly dangerous in high-stakes domains (legal, financial, healthcare) where judge errors compound across decisions.

## Benchmark gaming

In April 2026, researchers at UC Berkeley&apos;s Center for Responsible, Decentralized Intelligence published findings that all eight major agent benchmarks could be reward-hacked to near-perfect scores without solving the actual tasks. Their exploit agent achieved:

- **SWE-bench Verified:** 100% (500/500) via a 10-line conftest.py that hooks into pytest and rewrites test results to &quot;passed&quot;
- **WebArena:** ~100% by navigating to file:// URLs and reading answers directly from the task config
- **GAIA:** 98% through answer leakage via public databases and answer normalization collisions
- **OSWorld:** 73% through direct environment manipulation
- **CAR-bench, FieldWorkArena, Terminal-Bench:** 100%

The paper, &quot;How We Broke Top AI Agent Benchmarks,&quot; documents vulnerability patterns systematically. The Berkeley team open-sourced their exploit toolkit as a diagnostic tool for benchmark maintainers, signaling that traditional benchmarks have fundamental measurement problems.

This echoes a broader pattern. Whenever a metric becomes a target, it ceases to be a good metric. Goodhart&apos;s law applies to agent benchmarks just as it does to college admissions tests or corporate KPIs.

## Production eval vs offline eval

Most teams discover the gap between benchmark performance and production reliability only after deployment. A 90% GAIA score doesn&apos;t guarantee a 90% success rate in production. Three reasons.

**Distribution shift:** Benchmark tasks are curated and balanced. Real-world agent queries are noisy, ambiguous, and adversarial. Agents that did well on clean benchmark data see scenarios in production they&apos;ve never encountered.

**Environment variability:** Benchmark environments are static and deterministic. Production environments have latency, failures, rate limits, and unexpected state changes. An agent that succeeds 95% of the time on a clean WebArena instance might succeed 60% of the time when handling real web services with occasional downtime.

**Feedback loops:** Benchmarks are evaluated once, offline. In production, failures compound. If an agent makes a mistake, downstream actions amplify the error. Benchmark evals don&apos;t capture this cascade.

Sophisticated teams close the gap with multi-layer evaluation:

- **Shadow evaluation:** Run the agent on real production queries but don&apos;t act on its outputs. Grade the outputs against human ground truth. This reveals how well the agent generalizes without risk.
- **Regression evaluation:** After an agent ships, run periodic evals on a fixed set of known tasks. This catches drift: did this week&apos;s model still handle the tasks it handled last week?
- **A/B evaluation:** Compare agent versions on the same real queries in production. Measure not just task completion but also latency, human intervention rate, and user satisfaction.

Offline benchmarks tell you what your agent could do on curated tasks. Production evals tell you what it actually does in the wild. Most teams have the first and very little of the second, and the bridge between them is where the category is currently weakest.

## Common questions

&lt;FAQBlock items={[
  {
    question: &quot;How do I write evals for my agent?&quot;,
    answer: &quot;Start narrow. Pick 5 to 10 representative tasks your agent should handle. Grade them manually or with a human rubric. Then decide: can this be graded programmatically (function call, exit code)? If yes, automate it. If no, build a lightweight annotation interface and grade manually or with a jury of evaluators. Once you have a working eval, expand to 50 to 100 tasks. Then scale with LLM-as-judge or a pre-built framework like DeepEval or Inspect AI.&quot;
  },
  {
    question: &quot;Why do my LLM-as-judge scores swing by 20% on the same agent output?&quot;,
    answer: &quot;Judge nondeterminism. A few causes. First, sampling: if your judge runs at temperature greater than 0, repeated calls produce different scores on identical inputs. Set temperature to 0 and seed the model if your provider supports it. Second, the judge prompt is underspecified: if you ask &apos;is this answer correct?&apos; without a rubric, the judge hallucinates criteria differently each time. Pin the rubric with worked examples (good answer A scores 5, partial answer B scores 3, wrong answer C scores 1). Third, position bias: when the judge compares two responses, A vs. B, it tends to favor whichever comes first. Randomize the order on each comparison. Fourth, model drift: a managed LLM judge can change behavior week-to-week as the provider updates the model. Pin the model version (for example claude-3-5-sonnet-20241022 rather than claude-3-5-sonnet) in your eval config.&quot;
  },
  {
    question: &quot;My agent passes the eval suite but breaks on the first real user query. What&apos;s going on?&quot;,
    answer: &quot;Two failure modes that compound. First, your eval set doesn&apos;t match the production distribution. Eval tasks tend to be clean and well-specified. Real user queries are messy: misspellings, ambiguous goals, missing context, follow-up questions that reference earlier turns. If your evals are single-turn and your production agent gets multi-turn conversations, the eval was measuring a different thing. Second, your eval set has leaked into the training data or prompt. Some teams iterate on the same eval tasks long enough that the agent (or the prompt) gets overfit to them. Diagnostic: take five real user queries from the last 48 hours, add them to the eval suite, and watch the pass rate drop. The fix is continuous. Pull a fresh sample of real queries each week and add them to the eval set, while keeping a frozen &apos;golden&apos; subset to detect regressions.&quot;
  },
  {
    question: &quot;How do I connect my offline eval scores to what&apos;s happening in production?&quot;,
    answer: &quot;Most teams don&apos;t, and the gap is the source of a lot of bad surprises. The standard hack: tag each production trace with the model version and prompt version that produced it, then run your eval suite against the same prompt/model combo whenever you ship a change. That gives you correlation, not causation. Enough to catch regressions before users do. The harder version is matching the distribution of production queries (not just rerunning your eval set), which means sampling real queries weekly into your eval suite and watching the pass rate. Teams that do this seriously end up with three concentric eval rings: a frozen golden set (regression detection), a sliding production sample (distribution tracking), and the underlying benchmark (capability ceiling). Few products thread these together cleanly today. Most teams build the connection themselves with scripts and shared dashboards.&quot;
  }
]} /&gt;

## Further reading

- UC Berkeley&apos;s &quot;How We Broke Top AI Agent Benchmarks&quot; (rdi.berkeley.edu, April 2026). Comprehensive analysis of benchmark vulnerabilities with open-source exploit toolkit.
- Anthropic&apos;s &quot;Demystifying evals for AI agents&quot;. Practical guidance on eval strategy for multi-step agent tasks.
- Thoughtworks&apos; &quot;LLM benchmarks, evals and tests: A mental model&quot;. Framework for thinking about the differences between benchmarks, evals, and production tests.

See also:

- [What is an AI agent?](/blog/2026-05-08-agents-101)
- [What is agent observability?](/blog/2026-05-09-agent-observability)</content:encoded></item><item><title>What is an LLM gateway?</title><link>https://tokenjam.dev/blog/2026-05-11-llm-gateways/</link><guid isPermaLink="true">https://tokenjam.dev/blog/2026-05-11-llm-gateways/</guid><description>LLM gateways unify provider APIs, add fallbacks and caching, and centralize key management: what they do, when you need one, and the tools that exist.</description><pubDate>Mon, 11 May 2026 00:00:00 GMT</pubDate><content:encoded>import TLDR from &apos;@/components/TLDR.astro&apos;;
import FAQBlock from &apos;@/components/FAQBlock.astro&apos;;

&lt;TLDR&gt;
- An LLM gateway is a unified API layer that abstracts away provider differences, sitting between your app and OpenAI, Anthropic, Bedrock, or any other LLM provider.
- Core functions: single API interface, key management, rate limiting, automatic fallbacks, retries, response caching, and observability of API traffic.
- Teams adopt them to reduce vendor lock-in, control costs, swap models without code changes, and gain visibility into LLM usage.
- Gateways and observability tools are converging, though they solve different problems: routing decisions vs. measurement.
- You need one if you use multiple providers or run agents in production; single-provider hobby projects don&apos;t require one.
&lt;/TLDR&gt;

An LLM gateway is a unified API layer that sits between your application and one or more LLM providers, abstracting provider-specific APIs into a single interface and adding cross-cutting concerns like routing, fallbacks, key management, and caching. Instead of writing integration code for OpenAI&apos;s SDK, then Anthropic&apos;s, then AWS Bedrock&apos;s, you write once against a gateway and let it handle the details of talking to each provider.

A gateway acts as a reverse proxy for LLM traffic. It intercepts your requests, enforces policies, applies transformations, routes to the appropriate backend, and logs what happened. Some gateways are thin (basic API translation). Others are thick (guardrails, cost control, agentic features).

## Why teams adopt gateways

**Provider diversification without integration overhead.** Writing per-provider SDKs into production code couples your app to each provider&apos;s API shape. A gateway abstracts that away. You call one interface and can swap providers, or add fallbacks, without touching your application code.

**Cost control and transparency.** Most production LLM usage surprises teams. A gateway intercepts every request and can log cost, tokens consumed, latency, and error rates in one place. Some gateways support per-user budgets and per-model cost caps with real-time spend tracking. When a cheaper model works as well as an expensive one, you can route to it and measure the impact.

**Fallback routing when providers fail.** If your primary provider is rate-limited or down, a gateway can automatically retry against a secondary provider. This is critical for production agents where &quot;sorry, OpenAI is having issues&quot; is not an acceptable failure mode.

**Model swaps without code changes.** A/B testing models, rolling out new ones, or deprecating old ones becomes a configuration change in the gateway instead of a deployment. This matters when you&apos;re optimizing for latency, cost, or quality and need to experiment quickly.

**Simpler developer experience.** One API. One set of credentials to manage. One place to understand rate limits and retry behavior. Valuable when many developers and services are calling LLMs.

## Core capabilities

Most gateways provide a baseline set of features:

- **Unified API:** an OpenAI-compatible REST API or SDK, so clients don&apos;t have to change when you swap providers.
- **Key management:** credentials for each provider are stored centrally (encrypted at rest), so your application never sees raw API keys. This reduces credential sprawl and the risk of keys leaking in logs or code.
- **Rate limiting and quotas:** enforce per-user, per-team, or per-model rate limits and budget caps. When limits are hit, the gateway rejects requests gracefully instead of letting them propagate to the provider and incur charges.
- **Automatic fallback and retries:** if a request fails, the gateway can retry the same provider, fall back to an alternate provider, or both. Configurable policies let you set retry counts, backoff strategies, and provider precedence.
- **Request and response caching:** cache identical or semantically similar requests to avoid redundant API calls. Some gateways support semantic caching (caching based on meaning, not exact string match), which saves cost and latency for agents that reuse context.
- **Traffic observability:** log every request: latency, tokens (input and output), cost, error messages, provider, model used, user or app identifier. This data is what makes debugging, cost allocation, and performance monitoring possible.

## The convergence: gateways and observability

The boundary between gateways and observability platforms has blurred. Many gateways now ship observability features. Many observability platforms now offer proxy-like routing. Conceptually they solve different problems.

**Gateways make routing decisions.** They choose which provider to call. They decide when to retry, and when to fall back. They&apos;re about control: deciding what happens to each request.

**Observability measures what happened.** It captures latency, tokens, cost, errors, and other metrics. It&apos;s about visibility: understanding the system after the fact.

A concrete example. You deploy a gateway configured to route 80% of requests to Claude Opus and 20% to a cheaper model. The gateway controls the routing (gateways). Your observability platform logs each request and shows you that the cheaper model has a higher error rate and longer latency (observability). Armed with that data, you adjust the split back to 90%/10%.

[Helicone](https://helicone.ai) is the canonical example of a hybrid: a proxy gateway (with routing and fallback logic) and an observability-first platform (with dashboards, evals, and experiment tracking built in). The two functions remain distinct. Many teams run a lightweight gateway (like LiteLLM) alongside a separate observability tool, or vice versa.

## Notable tools

- **[LiteLLM](https://litellm.ai)** is an open-source Python SDK and proxy server supporting 100+ providers (OpenAI, Anthropic, Bedrock, Vertex AI, Cohere, and more) via an OpenAI-compatible interface, with built-in cost tracking, guardrails, load balancing, and logging. Highly extensible and widely used in production.
- **[OpenRouter](https://openrouter.ai)** is a managed gateway exposing 500+ models from 60+ providers through a single OpenAI-compatible API, with intelligent routing based on cost, latency, and availability, plus automatic failover when providers are down or rate-limited. Ideal for teams that want to use many models without managing multiple accounts and keys.
- **[Vercel AI Gateway](https://vercel.com/docs/ai-gateway)** is a Vercel-native managed gateway supporting bring-your-own-key (BYOK) authentication with no additional markup, tightly integrated with the Vercel AI SDK and built for fast model iteration in production applications.
- **[Cloudflare AI Gateway](https://developers.cloudflare.com/ai-gateway/)** is edge-deployed across Cloudflare&apos;s 330-city network, supporting multiple providers (OpenAI, Anthropic, Hugging Face, Bedrock, and more) with caching, rate limiting, retries, and model fallback. Particularly strong for latency-sensitive workloads due to edge placement.
- **[Portkey](https://portkey.ai)** is a managed gateway and production control plane supporting 1,600+ LLMs with enterprise-grade governance (RBAC, SSO, granular budgets), compliance certifications (SOC2, ISO 27001, GDPR, HIPAA), and deployment options (SaaS, hybrid, or air-gapped). Designed for teams with strict security and audit requirements.
- **[Bifrost](https://docs.getbifrost.ai)** is a high-performance open-source gateway written in Go, optimized for low-latency, high-RPS workloads with roughly 40x lower latency than LiteLLM, supporting 15+ providers and offering adaptive load balancing, clustering, and guardrails. Best for teams that prioritize infrastructure performance over feature breadth.
- **[Helicone](https://helicone.ai)** is an open-source hybrid observability and gateway platform with a cloud-hosted API and self-hosted options, supporting 100+ providers with zero-markup pricing and built-in evals, experiments, and monitoring dashboards. Strong choice if you want observability and routing in one platform.
- **[Kong AI Gateway](https://konghq.com/products/kong-ai-gateway)** is an enterprise API gateway with dedicated AI connectivity features, supporting LLM, MCP, and A2A routing with usage analytics, provider-agnostic routing, and deployment options (Konnect SaaS or self-hosted Enterprise). Designed for large organizations already using Kong for general API management.

## When you need a gateway (and when you don&apos;t)

### You probably don&apos;t need one if

- You are building a toy or hobby project that only calls one provider (e.g., a ChatGPT wrapper for personal use).
- You are prototyping and don&apos;t care about observability, fallbacks, or cost tracking yet.
- Your application is so simple that it doesn&apos;t benefit from vendor diversification or cost control.

### You almost certainly do need one if

- You are running agents or applications in production, even single-provider, because you want observability and automatic retries.
- You use multiple LLM providers and want a single interface without per-provider integration overhead.
- You need to control costs: budget per user, per team, or globally; route to cheaper models when quality permits; or measure spend by application.
- You want to experiment with models or providers without code changes. Swapping models should be a config update, not a deployment.
- You need automatic fallback routing: if provider A is down or rate-limited, automatically try provider B.
- You are building multi-tenant applications or services where different customers may have different provider preferences or budgets.

If your LLM use is scattered across multiple services and providers, a gateway pays for itself in visibility and operational safety.

## Common questions

&lt;FAQBlock items={[
  {
    question: &quot;Why is my gateway slower than calling the LLM provider directly?&quot;,
    answer: &quot;Three common causes. First, the gateway hop adds network latency: every request now goes through an extra service before reaching the provider. A well-deployed gateway adds single-digit milliseconds. A misconfigured one (deployed in a different region than the provider&apos;s endpoint, behind a slow load balancer, or running on undersized hardware) can add 50 to 200ms. Second, retry logic: if your gateway is configured to retry on transient errors, a flaky connection that succeeds on retry will look like the gateway added latency, when really the gateway hid an underlying problem. Third, semantic-cache overhead: a misconfigured cache that runs similarity search on every request can add 20 to 50ms even when nothing is cached. Profile with the gateway disabled and enabled, compare p50/p95/p99 across the same workload. Most &apos;gateway is slow&apos; complaints turn out to be retries firing on a flaky provider, not the gateway itself.&quot;
  },
  {
    question: &quot;Why does my gateway keep falling back when the primary provider works fine?&quot;,
    answer: &quot;Most fallback policies fire on any error, not only on the ones you care about. The usual culprits: rate-limit response headers from the provider that the gateway misreads as failure; transient timeouts because your gateway&apos;s timeout is set tighter than the provider&apos;s actual p99; &apos;soft&apos; errors like a malformed completion that the gateway counts as a hard failure. Open the gateway&apos;s trace logs and look at which provider got hit, which error fired, and what the policy did with it. Most gateways let you scope fallback policies to specific error codes (5xx and rate-limit responses, but not 4xx auth/validation errors). If the same fallback fires for thousands of requests in a row, it&apos;s almost always a misconfigured policy rather than a real provider outage.&quot;
  },
  {
    question: &quot;Do I need a gateway if I&apos;m only using one provider?&quot;,
    answer: &quot;Not strictly. If you use OpenAI exclusively and don&apos;t care about observability or fallbacks, calling their SDK directly is simpler. Even single-provider deployments benefit from a lightweight gateway for rate limiting, cost tracking, and automatic retries. Many teams find that a thin gateway (like LiteLLM in passthrough mode) adds minimal overhead and meaningful resilience.&quot;
  },
  {
    question: &quot;How much latency does a gateway add?&quot;,
    answer: &quot;It depends. A well-designed gateway adds single-digit milliseconds (Bifrost claims under 100 microseconds of overhead). A poorly designed one can add hundreds of milliseconds. If latency-sensitive workloads matter, benchmark the specific gateway and configuration in your environment. Edge-deployed gateways like Cloudflare&apos;s can reduce latency by sitting closer to users.&quot;
  },
  {
    question: &quot;Can I self-host a gateway?&quot;,
    answer: &quot;Yes. LiteLLM, Helicone, Bifrost, and Kong are all open-source and self-hostable. Vercel and Portkey offer managed services that also support self-hosted or hybrid deployment. Self-hosting trades operational overhead (you run the infrastructure and handle upgrades) for control and privacy. Many teams start with a managed gateway and self-host later.&quot;
  },
  {
    question: &quot;What&apos;s the difference between a gateway and an API management platform?&quot;,
    answer: &quot;API management platforms (Kong, Apigee, AWS API Gateway) are general-purpose tools for managing HTTP API traffic: RESTful services, webhooks, microservices, and the rest. They can be used as gateways for LLM traffic. They aren&apos;t LLM-specific. LLM gateways are purpose-built to handle things like model routing, token accounting, provider-specific quirks, and cost tracking. Many teams use an API management platform for general infrastructure and a specialized LLM gateway for LLM traffic.&quot;
  }
]} /&gt;

## Further reading

- [LiteLLM GitHub](https://github.com/BerriAI/litellm). Source for the most widely deployed open-source gateway.
- [LiteLLM Docs](https://docs.litellm.ai). Configuration and deployment guides.
- [OpenRouter Docs](https://openrouter.ai/docs). Model catalog and routing configuration.
- [Vercel AI Gateway](https://vercel.com/docs/ai-gateway). Integration with the Vercel AI SDK.
- [Cloudflare AI Gateway Docs](https://developers.cloudflare.com/ai-gateway/). Edge deployment and multi-provider setup.
- [Portkey AI Gateway](https://portkey.ai). Enterprise features and deployment models.
- [Bifrost Documentation](https://docs.getbifrost.ai). Performance benchmarks and Go-based architecture.
- [Helicone GitHub](https://github.com/Helicone/helicone). Source for the observability-focused hybrid platform.
- [Kong AI Gateway](https://konghq.com/products/kong-ai-gateway). Enterprise API management for AI connectivity.

See also: [What is agent observability?](/blog/2026-05-09-agent-observability). What is agent token economics (forthcoming).</content:encoded></item><item><title>What is OpenTelemetry, and why does it matter for AI agents?</title><link>https://tokenjam.dev/blog/2026-05-10-opentelemetry-for-ai-agents/</link><guid isPermaLink="true">https://tokenjam.dev/blog/2026-05-10-opentelemetry-for-ai-agents/</guid><description>OpenTelemetry, OTLP, and the GenAI semantic conventions: how the CNCF observability standard is becoming the lingua franca for AI agent telemetry.</description><pubDate>Sun, 10 May 2026 00:00:00 GMT</pubDate><content:encoded>import TLDR from &apos;@/components/TLDR.astro&apos;;
import FAQBlock from &apos;@/components/FAQBlock.astro&apos;;

&lt;TLDR&gt;
- OpenTelemetry is the CNCF standard for vendor-neutral observability instrumentation. Write once, export anywhere.
- Three core components: SDKs (in your code), OTLP (the wire protocol), and collectors/backends (where data lives).
- The GenAI semantic conventions define a shared schema for LLM traces: `gen_ai.request.model`, `gen_ai.usage.input_tokens`, and others. They&apos;re actively evolving but already widely adopted.
- Claude Code natively emits OTLP traces with `CLAUDE_CODE_ENABLE_TELEMETRY=1`; agent frameworks like LangChain, LlamaIndex, and others follow the same pattern.
- Lock-in is the real problem OTel solves. Instrument once, and any OTel-aware backend can consume your traces without re-architecting.
&lt;/TLDR&gt;

OpenTelemetry is the Cloud Native Computing Foundation&apos;s standard for collecting and exporting observability signals (traces, metrics, and logs) from applications. Instead of locking you into a single vendor&apos;s telemetry format, OpenTelemetry defines how applications emit telemetry data in a vendor-neutral way, using the OpenTelemetry Protocol (OTLP). You instrument your code once and can send your telemetry to any compatible backend: Datadog, New Relic, Grafana, Jaeger, or any other system that speaks OTLP.

## The three components: SDKs, OTLP, and backends

Three moving parts.

### 1. SDKs (instrumentation in your code)

An OpenTelemetry SDK is a library that runs in your application. It collects traces, metrics, and logs from your code and hands them off for export. You install it, configure it, and call its APIs (or rely on auto-instrumentation) to emit telemetry. For Python agents, the OpenTelemetry Python SDK is the foundation. For TypeScript, OpenTelemetry JavaScript serves the same purpose.

SDKs do the heavy lifting. They manage span lifecycle, batch telemetry, apply sampling policies, and handle backpressure when backends are slow. Different instrumentation libraries (for LangChain, Anthropic, Ollama, and others) sit on top of an SDK and emit standardized spans into it.

### 2. OTLP: the wire protocol

OTLP (OpenTelemetry Protocol) is how telemetry gets from your SDK to a backend. OTLP runs over gRPC or HTTP/1.1, uses Protocol Buffers for encoding, and specifies backpressure handling and retry semantics. You don&apos;t think about OTLP directly. It&apos;s configured via environment variables like `OTEL_EXPORTER_OTLP_ENDPOINT` and `OTEL_EXPORTER_OTLP_HEADERS`. It&apos;s the contract between your SDK and any backend that claims OpenTelemetry support.

### 3. Collectors and backends

An OpenTelemetry collector is a standalone service that receives telemetry data via OTLP and routes it to backends, applies transformations, and handles batching at scale. A backend (Datadog, Grafana Loki, Jaeger, Honeycomb, and others) stores and queries your traces. You can skip the collector for small workloads. Many apps export directly to a cloud backend via OTLP. Collectors give you flexibility: they let you filter and enrich telemetry before it hits your backend, and they buffer data when backends are slow.

## Why agents need OpenTelemetry specifically

Vendor lock-in is real for observability. If you instrument your agent to emit telemetry in Datadog&apos;s proprietary format, switching to New Relic means rewriting instrumentation across your codebase. For organizations with many agents and teams, this tax is enormous.

OpenTelemetry fixes this by making the instrumentation the constant, not the vendor. Your agent code emits OTLP. Your backend is the variable. You can migrate backends, or use multiple backends simultaneously, without touching your instrumentation layer.

This matters more for agent teams because agent complexity is growing. A modern agent traces LLM calls, tool invocations, retrieval steps, and agent reasoning across multiple frameworks and runtimes. A shared observability standard means you&apos;re not training teams to emit telemetry differently for each agent tool; they all follow the same conventions.

## The GenAI semantic conventions

OpenTelemetry includes a specification for semantic conventions: standardized attribute names and meanings that make spans interoperable across backends. For generative AI, the [OpenTelemetry GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/) define how to structure traces from LLM calls and agent steps.

Key attributes include:

- `gen_ai.system`: the GenAI system or LLM provider (e.g., `openai`, `anthropic`, `ollama`).
- `gen_ai.request.model`: the name of the model being invoked (e.g., `claude-3-5-sonnet`).
- `gen_ai.operation.name`: the operation type (e.g., `chat`, `completion`, `embedding`).
- `gen_ai.usage.input_tokens` and `gen_ai.usage.output_tokens`: token counts from the LLM response.
- `gen_ai.response.id`: the response ID from the model provider.
- `gen_ai.agent.id`, `gen_ai.agent.name`, `gen_ai.agent.version`: identity and version of the agent.
- `gen_ai.conversation.id`: unique identifier for a conversation thread (for multi-turn traces).

These conventions are actively evolving; the specification is not frozen. That&apos;s by design. As new use cases emerge (tool use, function calling, retrieval-augmented generation, multi-agent coordination), the spec grows. Tools that adopt the conventions now benefit immediately. They gain interoperability across backends even as the spec matures.

Adopting these conventions in your agent instrumentation means any OpenTelemetry-aware backend can parse and query your traces without custom parsing logic. You get consistent dashboards and analytics across vendors.

## How real agent runtimes emit OTel today

OpenTelemetry adoption in the agent ecosystem is accelerating. Concrete examples:

### Claude Code

Claude Code natively emits OpenTelemetry traces when you set the `CLAUDE_CODE_ENABLE_TELEMETRY=1` environment variable. You then configure where traces go using standard OTEL environment variables:

```bash
export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_EXPORTER_OTLP_ENDPOINT=https://your-backend.example.com:4317
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
```

For full configuration details, see the [Claude Code environment variables documentation](https://docs.claude.com/en/docs/claude-code/settings).

### LangChain

LangChain supports OpenTelemetry instrumentation via the `opentelemetry-instrumentation-langchain` package. You instrument your LangChain app and export via any OTLP-compatible backend:

```python
from opentelemetry.instrumentation.langchain import LangchainInstrumentor

LangchainInstrumentor().instrument()
```

Traces follow the GenAI conventions, so your LangChain chains are observable across any OpenTelemetry-aware platform.

### LlamaIndex + OpenInference

LlamaIndex integrates with [OpenInference](https://github.com/Arize-ai/openinference), a set of conventions built on top of OpenTelemetry for AI observability. OpenInference spans are valid OTLP traces, so you get the same portability as native OpenTelemetry.

### OpenLLMetry by Traceloop

[OpenLLMetry](https://www.traceloop.com/docs/openllmetry) is a collection of OpenTelemetry instrumentations for LLM apps. It provides ready-made instrumentation for LangChain, Anthropic, Ollama, Pinecone, Qdrant, and many other LLM-adjacent services. Because it&apos;s built on OpenTelemetry, any instrumentation you install works with any OTLP backend.

## Notable tools and SDKs

The OpenTelemetry ecosystem for agents includes:

- **OpenTelemetry Python SDK:** the core SDK for Python agents. Use this as your foundation for any Python-based agent instrumentation.
- **OpenTelemetry JavaScript SDK:** the equivalent for Node.js and browser-based agents.
- **OpenLLMetry:** pre-built instrumentations for LangChain, Anthropic, OpenAI, LlamaIndex, Ollama, Qdrant, and others. Reduces boilerplate if your agent uses popular frameworks.
- **OpenInference by Arize:** a semantic convention and instrumentation set for AI workloads. Integrates with OpenTelemetry and works with any OTel backend, including Arize Phoenix, Jaeger, and Datadog.
- **Phoenix by Arize:** an open-source observability tool for ML and LLM apps that consumes OpenInference (and thus OpenTelemetry) traces.
- **Collector distributions:** OpenTelemetry Collector is the standard. Vendor-specific distributions (e.g., Datadog Agent, New Relic Agent) also speak OTLP.

## Common questions

&lt;FAQBlock items={[
  {
    question: &quot;Why are my LLM calls showing up as HTTP spans instead of GenAI spans?&quot;,
    answer: &quot;You probably have base HTTP instrumentation without an LLM-aware layer on top. The default OpenTelemetry HTTP instrumentation captures your LLM API calls as plain HTTP spans (POST /v1/messages, 200 OK, 142ms). They show up. They&apos;re just missing the actually-useful attributes: model name, token counts, response ID. To get GenAI semantic-convention spans, install an LLM-aware instrumentor: OpenLLMetry&apos;s Anthropic or OpenAI instrumentor, OpenInference, or use a framework that emits GenAI spans natively (Claude Code, LangChain via its OTel package). Install the instrumentor (e.g., opentelemetry-instrumentation-anthropic from OpenLLMetry) and initialize it before your code creates the LLM client. After that, calls to client.messages.create() should produce gen_ai.* spans alongside the HTTP spans, and you can filter on gen_ai.system in your backend.&quot;
  },
  {
    question: &quot;Which OpenTelemetry SDK should I use with my agent framework?&quot;,
    answer: &quot;Depends on your language and framework. Python agents use the OpenTelemetry Python SDK. If you&apos;re on LangChain, LlamaIndex, or another framework, look for that framework&apos;s OTel instrumentation package first (via OpenLLMetry or framework-native support). If no instrumentation exists, you can hand-instrument your code using the SDK directly.&quot;
  },
  {
    question: &quot;What&apos;s OTLP?&quot;,
    answer: &quot;OTLP is the OpenTelemetry Protocol: the wire format and transport mechanism for sending telemetry data from your SDK to a collector or backend. It&apos;s built on Protocol Buffers and runs over gRPC or HTTP/1.1. You don&apos;t configure OTLP directly. You set environment variables like OTEL_EXPORTER_OTLP_ENDPOINT to point your SDK at a backend.&quot;
  },
  {
    question: &quot;Does setting CLAUDE_CODE_ENABLE_TELEMETRY=1 send my data to Anthropic?&quot;,
    answer: &quot;No. The flag tells Claude Code to emit OpenTelemetry traces to whatever OTLP endpoint you configure via OTEL_EXPORTER_OTLP_ENDPOINT. If you don&apos;t set an endpoint, the SDK has nowhere to send them and they&apos;re dropped on the floor. Anthropic doesn&apos;t receive your traces from this path. That&apos;s distinct from Anthropic&apos;s usage-and-billing telemetry, which is sent to Anthropic regardless of the OTel flag because it&apos;s how the API gets metered. The OTel data is for you: send it to Datadog, Grafana, a local Jaeger, or wherever you run observability.&quot;
  },
  {
    question: &quot;How do I set up telemetry export in my agent?&quot;,
    answer: &quot;Standard pattern: install the OpenTelemetry SDK for your language, install instrumentation packages for your frameworks (LangChain, Anthropic, and so on), initialize the instrumentation in your agent startup code, then set OTEL environment variables to point at your backend. The variables you&apos;ll need are OTEL_EXPORTER_OTLP_ENDPOINT (your backend&apos;s OTLP endpoint), OTEL_EXPORTER_OTLP_PROTOCOL (grpc or http/protobuf), and OTEL_EXPORTER_OTLP_HEADERS (auth headers, if needed). Your backend&apos;s documentation will list the specific OTLP endpoint URL to use.&quot;
  },
  {
    question: &quot;Can I use OpenTelemetry with multiple backends simultaneously?&quot;,
    answer: &quot;Yes. Configure multiple exporters in your SDK, or use an OpenTelemetry Collector to fan telemetry out to multiple destinations. Common during a backend migration, or when you want redundancy.&quot;
  }
]} /&gt;

## Further reading

- [OpenTelemetry GenAI semantic conventions specification](https://opentelemetry.io/docs/specs/semconv/gen-ai/).
- [OTLP protocol specification](https://opentelemetry.io/docs/specs/otlp/).
- [Claude Code environment variables and telemetry setup](https://docs.claude.com/en/docs/claude-code/settings).
- [OpenLLMetry documentation and instrumentations](https://www.traceloop.com/docs/openllmetry).
- [OpenInference specification for AI observability](https://github.com/Arize-ai/openinference).

See also: [What is agent observability?](/blog/2026-05-09-agent-observability), [Agents 101: Reasoning, Actions &amp; Autonomy](/blog/2026-05-08-agents-101).</content:encoded></item><item><title>What is agent observability?</title><link>https://tokenjam.dev/blog/2026-05-09-agent-observability/</link><guid isPermaLink="true">https://tokenjam.dev/blog/2026-05-09-agent-observability/</guid><description>How AI agent observability works: capturing tool calls, token costs, traces, and behavioral patterns at production scale.</description><pubDate>Sat, 09 May 2026 00:00:00 GMT</pubDate><content:encoded>import TLDR from &apos;@/components/TLDR.astro&apos;;
import FAQBlock from &apos;@/components/FAQBlock.astro&apos;;

&lt;TLDR&gt;
- Agent observability captures what an agent did (tool calls, token costs, latency, reasoning chains) at detail sufficient to debug and audit behavior in production.
- Traditional logs and metrics aren&apos;t enough; you need traces that record the LLM&apos;s step-by-step decisions, tool invocations, and outcomes.
- Agents are harder to observe than services because of nondeterminism, deeply nested calls, prompts and completions as data, and vocabulary that didn&apos;t exist three years ago.
- OpenTelemetry GenAI semantic conventions are becoming the emerging standard for agent telemetry.
&lt;/TLDR&gt;

Agent observability is the practice of capturing what an AI agent did, its tool calls, token costs, behavioral patterns, and outcomes, at a level of detail sufficient to debug, optimize, and audit agent behavior in production. You record the agent&apos;s full journey: every decision point, every tool invocation, every LLM call with inputs and outputs, latencies, costs, and errors. Service observability captures what your code did. Agent observability captures the reasoning chain itself: the sequence of thoughts and decisions that led the agent to act.

## Why agent observability is harder than service observability

Service observability is built on a predictable model. A request comes in, your code executes a series of steps, a response goes out. Each step is deterministic. Logs tell you what happened. Metrics tell you how long it took and whether it succeeded.

Agents break this model.

**Nondeterminism is the core problem.** The same input to an agent with the same model and parameters might produce different outputs on different runs. The LLM samples from a probability distribution. You can&apos;t debug an agent from logs alone. You have to capture the complete trace of that specific run to understand what reasoning led to that specific output.

**Tool calls are deeply nested.** A service call stack might be five or ten levels deep. An agentic system can have an agent call a tool, which triggers a retrieval operation, which calls an embedding model, which calls a database, which triggers another tool. The nesting is deep and irregular. A trace that doesn&apos;t capture every step in this chain will miss the real bottleneck.

**Prompts and completions are your actual data.** In a service, your data is SQL queries and JSON payloads. In an agent, your data is the prompt sent to the LLM and the completion it returned. These are large and unstructured. They&apos;re often sensitive: they contain user context, proprietary information, internal state. Traditional logging systems don&apos;t handle this well. Observability for agents has to be built around capturing and safely storing these artifacts.

**The vocabulary didn&apos;t exist three years ago.** Terms like &quot;token usage,&quot; &quot;tool selection,&quot; &quot;context window,&quot; and &quot;hallucination&quot; are specific to the agentic context. Existing APM tools (application performance monitoring tools like Datadog, New Relic, Dynatrace) were built for microservices. They have no native concept of an LLM call, a token count, or a tool invocation. Shoehorning agent data into these systems works. It&apos;s also awkward.

## The three pillars, adapted for agents

Observability has three pillars: traces, metrics, logs. The definitions shift when you apply them to agents.

**Traces** capture the complete execution path of a request. In a microservice, a trace is a sequence of function calls and RPC hops. In an agent, a trace is the agent&apos;s full journey: the user input, each LLM call (with prompt and completion), each tool invocation and result, latency at each step, token usage at each step, and the final output. A trace is the highest-fidelity record you have. It answers questions like &quot;Why did the agent choose tool X instead of tool Y?&quot; or &quot;Where did the latency spike occur?&quot;

**Metrics** are aggregations: counts and percentiles. In services, you track request latency, error rate, throughput. For agents, you track cost per request (sum of token usage × model pricing), latency per LLM call, tool invocation frequency, error rates (both LLM errors and tool errors), and token efficiency (useful output tokens vs. wasted context). Metrics let you spot trends over time and set up alerts when something goes wrong at scale.

**Logs** are raw events: &quot;This LLM call failed,&quot; &quot;Token limit exceeded,&quot; &quot;Tool returned an error.&quot; In a service, logs focus on errors. In an agent, logs are also informational: &quot;Agent selected tool X.&quot; &quot;Retry attempt 2 of 3.&quot; Logs are lower resolution than traces. They&apos;re faster to query and more storage-efficient.

## What you actually capture

A production-grade agent observability system captures:

- **LLM calls:** model name, parameters (temperature, max_tokens, top_p), the prompt sent, the completion received, token counts (input and output), latency, cost, success or failure. This is the core of agent observation.
- **Tool invocations:** tool name, input parameters, output, latency, whether the tool succeeded or failed, and any retry information. Tools are where your agent touches the outside world. They cause most of your latency and most of your errors.
- **Token usage per call:** not just total tokens consumed. A breakdown: how many tokens in the context window, how many in the prompt, how many in the response. This helps you optimize context and identify tokens wasted on irrelevant content.
- **The agent&apos;s reasoning chain:** the intermediate thoughts or justifications the agent produced at each step. Some LLM frameworks (like ReAct) explicitly generate these; others encode them implicitly. Capturing this chain is what lets you debug why an agent made a particular decision.
- **Model and parameters:** which model was used, which version, what temperature and sampling parameters. This matters because the same agent with different parameters can behave very differently.
- **Errors and retries:** when a tool call failed, did the agent retry? How many times? Did it eventually succeed or give up? This tells you if your agent is robust or brittle.
- **Latency per layer:** total latency is a sum of LLM latency + tool latency + overhead. Breaking this down tells you where to optimize.

These signals should conform to the OpenTelemetry semantic conventions for generative AI. The conventions define a standard schema for representing LLM calls, tool use, embeddings, and agent systems in trace data. Adopting the standard means your agent traces can be ingested by any OpenTelemetry-compatible backend (Jaeger, Datadog, Elastic, or a custom system) without vendor lock-in.

## Common questions

&lt;FAQBlock items={[
  {
    question: &quot;Why does my trace show 47 LLM calls when I only invoked the agent once?&quot;,
    answer: &quot;Three common causes. First, the framework you&apos;re using (LangChain, LlamaIndex, AutoGen, CrewAI) might be making nested chains where each step is itself an LLM call: a planning call, an action call, a reflection call, a synthesis call. A single user request fans out fast. Second, retries: if a tool call returns an unexpected error or the LLM produces malformed output, many frameworks silently retry with backoff, multiplying calls. Third, agent loops: if the agent can&apos;t converge on an answer, it keeps reasoning and acting until it hits a max-iteration limit. Open the trace tree and look at timestamps. Tightly clustered calls with the same model and parameters mean retries. Spread-out calls with different prompts mean the framework is decomposing the task more than you expected.&quot;
  },
  {
    question: &quot;My agent traces are 50MB each. Should I be worried?&quot;,
    answer: &quot;Yes, in a specific way. Trace size is dominated by prompt and completion text. A 50MB trace means you&apos;re sending massive prompts to the LLM: huge system prompts, retrieved documents, long conversation history, included file contents. The cost is real: that&apos;s a lot of input tokens per call. The performance hit is also real because most trace UIs struggle to render or query traces above ~10MB. Two fixes work. First, reduce what you put in the prompt: tighter system prompts, smarter retrieval, summarize conversation history rather than passing it raw. Second, configure your observability tool to truncate long fields above a threshold (Langfuse, Arize Phoenix, and Datadog all support this). Truncated traces are still useful for navigation, and you can fetch the full prompt from your application logs if you actually need it.&quot;
  },
  {
    question: &quot;Can I use my existing APM (Datadog, New Relic) for agents?&quot;,
    answer: &quot;Partially. Datadog and New Relic have built LLM modules onto their existing platforms. They work, but they weren&apos;t designed for agents from the ground up. They&apos;re better at capturing that an LLM call happened than at capturing the reasoning chain or the interaction between multiple tool calls. If you&apos;re already in Datadog, LLM Observability is a reasonable choice. If you&apos;re starting fresh, a tool built for agents will give you more signal.&quot;
  },
  {
    question: &quot;What should I capture in production agent traces?&quot;,
    answer: &quot;Start with: every LLM call (prompt and completion), every tool invocation (name and result), latency per call, total token usage, and final outcome (success or failure). Add error details if the agent failed. Once that&apos;s stable, add cost breakdown per model and tool selection reasoning. Don&apos;t try to capture everything on day one.&quot;
  },
  {
    question: &quot;How do I avoid storing sensitive data in traces?&quot;,
    answer: &quot;Most tools support redaction: marking which fields should not be logged (API keys, user PII, secrets). Some (like Datadog LLM Observability) ship with automatic PII detection. Build redaction into your SDK wrapper early; it&apos;s easier to add than to retrofit. Also consider sampling. You don&apos;t need to trace every request, just a statistically significant sample.&quot;
  },
  {
    question: &quot;How much overhead does observability add?&quot;,
    answer: &quot;Good observability SDKs are asynchronous. Traces are queued locally and sent in batches in the background, so they add minimal latency to your agent&apos;s response time. Expect overhead of 5 to 15 percent at the p99, depending on the tool and your stack. That&apos;s a worthwhile trade-off for production visibility.&quot;
  }
]} /&gt;

## Further reading

- [OpenTelemetry semantic conventions for generative AI](https://opentelemetry.io/docs/specs/semconv/gen-ai/). The emerging standard for agent telemetry. Start with the GenAI spans spec.
- [What is an AI agent?](/blog/2026-05-08-agents-101). Background on agent architecture, ReAct, and how agents differ from chatbots and workflows.
- What is OpenTelemetry for AI agents (forthcoming). Deep dive into OpenTelemetry&apos;s GenAI semantic conventions and how to instrument agents with OTel.
- What is agent token economics (forthcoming). How to measure and optimize the cost side of observability once you have traces.
- [Braintrust: Agent observability tracing guide](https://www.braintrust.dev/docs/guides/tracing). Practical walkthrough of what to trace.
- [Langfuse: AI agent observability](https://langfuse.com/docs/observability/overview). Case study of agent tracing in practice.</content:encoded></item><item><title>Agents 101: Reasoning, Actions &amp; Autonomy</title><link>https://tokenjam.dev/blog/2026-05-08-agents-101/</link><guid isPermaLink="true">https://tokenjam.dev/blog/2026-05-08-agents-101/</guid><description>A foundational definition: what AI agents are, how they differ from chatbots and workflows, and the components that make them work.</description><pubDate>Fri, 08 May 2026 00:00:00 GMT</pubDate><content:encoded>import TLDR from &apos;@/components/TLDR.astro&apos;;
import FAQBlock from &apos;@/components/FAQBlock.astro&apos;;

&lt;TLDR&gt;
- An AI agent uses an LLM to reason about a goal and decide what actions to take, calling tools and observing results until the goal is reached.
- Agents differ fundamentally from chatbots (which don&apos;t act) and workflows (which don&apos;t decide).
- The ReAct pattern (reasoning + acting) is the dominant architecture in modern agent systems.
- Agents range from copilots that suggest actions to fully autonomous systems that run unattended for hours.
- Key components: the LLM (reasoning), tools (actions), context/memory (state), and a control loop (orchestration).
&lt;/TLDR&gt;

An AI agent is a system that uses a large language model to make decisions and take actions in pursuit of a goal. It calls tools, observes what they return, and iterates until the goal is reached. A chatbot waits for the next message; an agent plans and executes its own sequence of steps.

## Why it matters

The term entered the mainstream in late 2022, when projects like AutoGPT showed that LLMs could direct their own execution. The concept wasn&apos;t new. Researchers had been studying goal-directed autonomous systems for decades. What changed was accessibility: capable base models (GPT-4, Claude) and standardized tool-calling APIs made it practical to build a working agent in a few dozen lines of code.

The word AGENT now gets used loosely. Some vendors call a chatbot with a search feature an agent. Others claim that any LLM inference with retrieval is &quot;agentic.&quot; This inflation matters. It obscures what&apos;s actually new and what&apos;s repackaging. Precision helps you know what you&apos;re building or evaluating.

Agents represent a shift in how LLMs are deployed. The old model: user asks a question, system returns an answer, conversation ends. Agents invert that. The system receives a goal, decides on sub-goals, gathers information, corrects itself, and iterates without waiting for permission between steps. New architecture. New error handling. New thinking about safety and observability.

## Agents vs. chatbots vs. workflows vs. traditional AI

A quick way to distinguish these four categories is to ask: does it use an LLM to decide what to do next? And can it call tools to act on those decisions?

**Chatbots** use an LLM to generate text. They don&apos;t call tools, and they don&apos;t pursue goals across steps. A customer-service chatbot answers your question. It doesn&apos;t modify your account or call internal APIs unless you ask. Even then, it tends to suggest options or retrieve data rather than decide and act. The LLM&apos;s job is to understand and respond.

**Workflows** call tools and pursue goals. They don&apos;t use an LLM to decide which tool to call or how to interpret the result. A workflow might be: fetch customer data, run a validation rule, log an event, send an email. Each step is predefined. Branching is rule-based. The LLM is not in the loop. Workflows are predictable and cheap. They break when the task is ambiguous or open-ended.

**Agents** combine both. The LLM observes the current state and decides which tool to call next. It adapts and self-corrects as it goes. If a tool call fails, the agent reasons about why and tries something else. The flexibility costs you something. Agents are less predictable, more expensive per inference, and harder to debug. The reward is open-ended tasks, where the path isn&apos;t predetermined.

**Traditional AI/ML systems** (classifiers, regressions, recommenders) optimize a fixed function learned from data. They have no LLM, and they don&apos;t pursue multi-step goals. They are specialized and efficient. Generalizing to a new task means retraining.

| Aspect | Chatbot | Workflow | Agent | Traditional ML |
| --- | --- | --- | --- | --- |
| Uses LLM to decide next step? | No (generates text) | No (follows rules) | Yes | No |
| Calls tools? | Rarely; usually retrieval only | Yes; predefined sequence | Yes; chosen by LLM | No |
| Pursues multi-step goal? | No (responds to input) | Yes; fixed path | Yes; adaptive path | No |
| Handles ambiguous tasks? | Moderate (can discuss) | Poor (requires rigid structure) | Good (can reason and adapt) | Poor |

## The ReAct pattern and core components

Most agents built since 2023 follow a pattern called **ReAct (Reasoning and Acting)**, introduced in Yao et al.&apos;s 2022 paper from Google Research and Princeton. The idea is straightforward. The LLM produces reasoning steps (thinking aloud about what it needs to do) interleaved with actions (tool calls). It observes the result, then reasons further.

A ReAct loop looks like this:

1. **Observation:** the agent observes the current state (the original goal, prior tool results, conversation history).
2. **Reasoning:** the LLM thinks through the problem: &quot;I need to fetch the user&apos;s account, check their history, then decide whether to approve the request.&quot;
3. **Action:** the agent calls a tool, say `fetch_account(user_id)`.
4. **Observation:** the agent receives the result and feeds it back to the LLM.
5. **Loop:** the LLM reasons again, decides on the next action, and repeats until it either reaches the goal or determines that the goal isn&apos;t achievable.

The pattern works because the reasoning traces make the LLM&apos;s decisions interpretable. You can see why it chose an action. They also enable self-correction: if a tool result is unexpected, the LLM can reason about what went wrong.

An agent&apos;s core components are:

- **The LLM (reasoning engine):** decides what action to take based on the goal and current state. The decision-making layer.
- **Tools (action layer):** functions the agent can call: APIs, database queries, code execution, web searches, file operations. Tools are how the agent affects the world.
- **Context and memory (state):** everything the agent knows: the original goal, conversation history, prior tool results, and any persistent state it needs. Without good memory management, agents hallucinate and repeat mistakes.
- **Control loop (orchestration):** the code that runs the loop. It calls the LLM, parses the output for tool calls, executes them, and feeds results back. Modern frameworks (Anthropic&apos;s Claude SDK, LangChain, LlamaIndex) handle this. You can also implement it from scratch.

## Levels of autonomy

Agents exist on a spectrum. On one end they are suggestion-based copilots that nudge you. On the other are autonomous systems that run unattended for hours.

**Copilot mode (suggestion):** the agent observes what you&apos;re doing and suggests the next action. You approve before it executes. Example: Cursor&apos;s autocomplete suggests the next line of code; you hit Tab to accept or Escape to reject. The model is doing some reasoning. You stay in control of execution.

**Agentic mode (supervised autonomy):** the agent makes and executes decisions within a scope you define. You might say &quot;add tests for this file&quot; and the agent writes tests, runs them, and shows you the result, all without asking permission between steps. You can pause or override at any point. Example: Claude Code in an IDE, or an agent working a bounded coding task. The agent is autonomous within the scope, not globally.

**Autonomous agent (unattended):** the agent pursues a goal with minimal human oversight. You set a goal (&quot;reduce our average response time by 10%&quot;) and the agent decides what to measure, what to try, what to roll back, and what to keep. It might run for days, making changes and watching outcomes. Example: an agent managing an experimentation platform, or optimizing an ad-bidding algorithm. These are rare and tend to be domain-specific. The cost of mistakes is too high for general-purpose deployment.

## Notable tools

The agent landscape is wide. Grouping by category is more useful than a flat list. Below: the categories that matter as of 2026, with prominent examples in each.

### Coding agents

The most visible category, and the one most builders encounter first.

- **[Claude Code](https://anthropic.com/product/claude-code)** (Anthropic): agentic coding tool in the terminal, IDE, and browser. Native OTLP telemetry support.
- **[Codex](https://openai.com/codex)** (OpenAI): CLI and IDE-based coding agent. Recently rebuilt; supports OAuth-based authentication.
- **[Cursor](https://cursor.com)**: AI code editor with agent mode. Autonomously explores codebases, edits files, runs tests.
- **[OpenHands](https://openhands.dev)** (formerly OpenDevin): open-source autonomous agent for software engineering. Runs in a Docker sandbox.
- **[Aider](https://aider.chat)**: open-source AI pair programmer for the terminal. Integrates with git, supports multiple LLM providers.
- **[Continue](https://continue.dev)**: open-source IDE extension for VS Code and JetBrains.

### Personal / general-purpose agents

This category emerged sharply in 2026. These agents aren&apos;t tied to a single domain like coding; they bridge messaging, scheduling, search, and personal automation.

- **[OpenClaw](https://openclaw.ai/)** (Peter Steinberger, MIT-licensed): the breakout OSS agent of 2026. Local-first personal assistant running across WhatsApp, Telegram, Slack, Discord, iMessage, and 20+ other channels. At 369k+ GitHub stars, currently the most-starred GitHub repo in history; defines the personal-agent category.
- **[Hermes Agent](https://hermes-agent.nousresearch.com/)** (Nous Research, MIT-licensed): open-source self-improving agent with persistent memory and skill learning. ~32k stars in two months. Built around the `agentskills.io` standard; differentiates by retaining what it learns across sessions.
- **[NemoClaw](https://www.nvidia.com/en-us/ai/nemoclaw/)** (NVIDIA, built on OpenClaw): enterprise-hardened OpenClaw distribution with sandboxing, audit logging, and on-device inference. Targets DGX Spark for local enterprise workloads.

### Agent frameworks and SDKs

For builders, not end users. These are how you build agents rather than run pre-built ones.

- **[LangChain Agents / LangGraph](https://langchain.com)**: the LangChain ecosystem. LangGraph is the newer state-machine-based approach; LangChain Agents is the older flexible API. Widely used despite ongoing critique of the abstraction layers.
- **[OpenAI Agents SDK](https://developers.openai.com/api/docs/guides/agents)**: OpenAI&apos;s official SDK for building agents on their models. Native HITL primitives, tool calling, and tracing.
- **[Anthropic Agent SDK](https://code.claude.com/docs/en/agent-sdk/overview)**: `claude-agent-sdk`, built-in tool use, prompt caching, and agentic patterns.
- **[CrewAI](https://crewai.com)**: multi-agent orchestration framework, organized around &quot;crews&quot; of role-defined agents that collaborate.
- **[AutoGen](https://github.com/microsoft/autogen)** (Microsoft): multi-agent conversation framework. Heavier than CrewAI, more research-flavored.
- **[Mastra](https://mastra.ai)**: TypeScript-native agent framework. Newer, growing fast in the JS/TS ecosystem.
- **[smolagents](https://github.com/huggingface/smolagents)** (Hugging Face): minimal-abstraction agent framework, designed to be small enough to read end-to-end.
- **[LlamaIndex](https://llamaindex.ai)**: primarily a RAG framework, but ships agent capabilities for retrieval-heavy use cases.

### Web-acting / computer-use agents

A distinct emerging category: agents that control browsers or full desktops rather than calling APIs.

- **[Anthropic Computer Use](https://docs.anthropic.com/en/docs/build-with-claude/computer-use)**: Claude can control a computer via screenshots and mouse/keyboard.
- **[Browser Use](https://github.com/browser-use/browser-use)**: open-source library for browser-controlling agents.
- **[Skyvern](https://skyvern.com)**: browser automation agent with vision capabilities.

(OpenAI&apos;s Operator was in this category but was reportedly retired in early 2026.)

### Vertical and domain-specific agents

- **[Devin](https://cognition.ai)** (Cognition): autonomous software-engineering agent. The original &quot;agent that does the whole job&quot; demo.
- **[Sierra](https://sierra.ai)**: customer-service agent platform.
- **[Manus](https://manus.im/)**: Chinese personal-agent platform; heavy integration with Chinese consumer apps.

### Historical mention

- **AutoGPT** (2023): open-source autonomous agent framework that brought the concept of LLM-driven agents to a wide audience. Architecturally important; today more historical than active.

## Common questions

&lt;FAQBlock items={[
  {
    question: &quot;How is an agent different from a chatbot?&quot;,
    answer: &quot;A chatbot responds. An agent pursues. Ask a chatbot &apos;book me a flight&apos; and it asks clarifying questions, then waits for you to confirm. Ask an agent and it gathers options, checks your calendar, considers your budget, and books, without asking permission between steps. The chatbot reacts. The agent acts.&quot;
  },
  {
    question: &quot;What&apos;s the difference between an agent and a workflow?&quot;,
    answer: &quot;A workflow is a fixed sequence of steps determined in advance. You define &apos;do A, then B, then C, with these rules for branching.&apos; A workflow always takes the same path for the same inputs. An agent reasons about which steps to take and in what order, adapting based on intermediate results. Workflows are predictable and efficient. Agents trade predictability for flexibility.&quot;
  },
  {
    question: &quot;Why does my agent keep calling the same tool five times in a row?&quot;,
    answer: &quot;That&apos;s a loop, and the LLM probably doesn&apos;t recognize what the tool returned as the answer it was looking for. Common causes: the tool returned an error and the agent retried with the same inputs; the response shape was different from what the LLM expected, so it kept trying; the system prompt left the goal vague enough that the LLM thrashes between candidates. Fixes that work: clearer descriptions in your tool schema, explicit error messages from the tool (&apos;not found&apos; rather than null), and a hard call-count budget so the loop terminates rather than burning tokens.&quot;
  },
  {
    question: &quot;How autonomous do agents actually get?&quot;,
    answer: &quot;Depends on the task and the risk. In low-risk domains (code suggestions, documentation), agents run nearly unsupervised. In higher-risk domains (financial transactions, customer-facing decisions), agents operate under constraints: bounded scope, human review loops, or escalation to a human when confidence is low. Most production agents are supervised autonomy, not full autonomy.&quot;
  },
  {
    question: &quot;Is it normal for a single Claude Code session to cost $40?&quot;,
    answer: &quot;Not normal, not rare. A long session that maintains a big context and re-reads files often will pile up tokens fast. Three places to look. First, prompt caching: is the run hitting the cache, or rebuilding the prompt every turn? Second, context bloat: huge system prompts, large repos, and many open files multiply per-call cost. Third, model choice: Opus is meaningfully pricier than Sonnet on the same workload. Set a hard spend cap and watch tokens per turn. Most overruns trace to context size, not call count.&quot;
  },
  {
    question: &quot;Why do some agents get stuck or make silly mistakes?&quot;,
    answer: &quot;Agents inherit their LLM&apos;s limitations. An LLM can hallucinate or misinterpret what a tool returned. Across multiple reasoning steps, these errors compound. A bad tool result leads the agent down the wrong path. Confirmation bias makes it ignore contradictory evidence. Good design mitigates the failure modes: clear tool descriptions, explicit error signals from tools, and a memory model that lets the agent backtrack rather than press on with bad state.&quot;
  }
]} /&gt;

## Further reading

- [ReAct: Synergizing Reasoning and Acting in Language Models](https://arxiv.org/abs/2210.03629). Yao et al., 2022 (ICLR 2023). The foundational paper introducing the ReAct pattern.
- [Building Effective AI Agents](https://www.anthropic.com/engineering/building-effective-agents). Anthropic&apos;s guide to architecture patterns, tool design, and implementation frameworks for single and multi-agent systems.
- [Writing Effective Tools for AI Agents](https://www.anthropic.com/engineering/writing-tools-for-agents). Anthropic&apos;s technical advice on tool design for agentic systems.
- [Anthropic Cookbook: Patterns and Agents](https://github.com/anthropics/anthropic-cookbook). Reference implementations and code examples.</content:encoded></item></channel></rss>