A user reports a bad answer. The product manager forwards it to engineering. Engineering opens the LLM API dashboard and sees the request succeeded. The model returned 200. Latency was normal. Tokens looked reasonable. There is no error to investigate, no exception to grep, no stack trace to read. And yet the answer was wrong.
This is the moment most teams discover that traditional logging does not work for LLM systems. The interesting failures are not crashes. They are decisions — which document the retriever surfaced, which tool the model called, which version of the prompt was in production at 2:14pm, what the temperature was, whether a cache returned a stale result, whether a guardrail rewrote the output. Without LLM tracing, every one of those questions is a guess.
This guide is a practitioner view of LLM tracing in production: what to capture, how to structure spans for AI agent workflows, why the OpenTelemetry GenAI semantic conventions matter, and what breaks once you scale beyond a single chat endpoint. It is part of the larger question of why your AI experiments are failing — because the system underneath the chat box is exactly where these failures hide.
What LLM Tracing Actually Is
LLM tracing applies distributed tracing — the same OpenTelemetry pattern that powers Datadog APM and Honeycomb — to the components of an AI system. A single user interaction becomes a tree of spans:
- The HTTP request that started the workflow
- Each retrieval call (vector search, SQL, API)
- Each prompt construction step
- Each LLM call (with model, parameters, token counts)
- Each tool call the model makes
- Each guardrail or post-processing step
- The final response sent back
Each span has a start time, end time, status, and structured attributes. They link to their parents. Together, they form a trace. A trace is the receipt for one decision an AI system made.
This is different from logging. Logs are events. Traces are causal chains. For AI agents — where a single request fans out into retrieval, reasoning, tool execution, and reflection — you need the chain, not just the events.
It is also different from AI agent observability as a system property. Tracing is a mechanism; observability is the outcome. Tracing is the data you must capture to be able to ask questions you have not yet thought of.
What to Capture: The Five Span Types That Matter
A production-grade LLM tracing setup captures five span types. Skip any of them and you blind yourself to a class of failure.
1. LLM Call Spans
The atomic unit. One span per call to a model. Every LLM call span must record:
- Model identity: the exact model string the provider returned (
gpt-5.1-2026-02-04, notgpt-5). Aliases hide silent upgrades. - Provider and endpoint: which vendor, which region, which gateway. This is how you debug “it’s slow in EU.”
- Request parameters: temperature, top_p, max_tokens, response_format, tools list. Most “the model is acting weird” tickets are temperature drift from a config change.
- Token counts: input and output, separately. Output tokens are 3–5x more expensive at most providers, and they drive latency.
- Latency: time-to-first-token and total time. These are different problems with different fixes.
- Finish reason: stop, length, tool_calls, content_filter. A trace where 12% of completions hit
lengthis a context-window bug, not a quality bug.
The OpenTelemetry GenAI semantic conventions standardize this with attributes like gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.response.finish_reasons. Adopting the standard is not religion — it is portability. You can change vendors without rewriting your dashboards.
2. Tool Call Spans
When the model decides to call a tool — a function, an MCP server, an API — that call gets its own span, parented to the LLM call that requested it.
Capture:
- Tool name and version
- Arguments the model produced (this is the single most useful debugging artifact in agent systems)
- Result or error
- Latency
- Whether the result was truncated before going back to the model
Most production agent failures are not LLM failures. They are tool failures: a stale API key, a 429, an argument the model hallucinated, a result that exceeded the context window and got silently chopped. You will not see any of this without tool spans.
3. Retrieval Spans
For RAG systems and any agent with memory, retrieval is where quality dies. A retrieval span should capture:
- The query (the rewritten query, not just the user’s words)
- The index or collection searched
- Top-k, filters, hybrid weights
- The IDs of returned chunks (not the full text — events handle that)
- Relevance scores
- Latency
When users complain “it forgot what I told it,” 80% of the time the retriever returned the wrong chunks. The other 20%, it returned the right chunks and the prompt template dropped them. Both are visible in retrieval spans.
4. Workflow / Agent Spans
The outer span that wraps the whole interaction. For multi-step agents, this includes a span per reasoning step. Capture:
- Workflow name and version
- Agent name (for multi-agent systems)
- Step number
- Decision the agent made at this step
- Why it stopped (completed, max steps, escalation, error)
This is the span you graph by p95 latency and error rate. It is also the span your business KPIs join to.
5. Guardrail and Post-Processing Spans
The often-forgotten layer. PII redaction, content moderation, JSON validation, schema repair — each one is a span. When a guardrail rewrites an output and the user complains the answer is incomplete, you need a span proving the rewrite happened and what it changed.
Do not put full prompts in span attributes
Attributes are indexed, size-limited, and exposed to anyone with read access to your tracing backend. Full prompts and completions belong in span events (the OpenTelemetry GenAI conventions explicitly call this out as an anti-pattern). Events can be sampled, filtered, or dropped at the Collector level without touching application code — which matters when one of those prompts contains a customer’s social security number.
OpenTelemetry GenAI: Why the Standard Matters
The ecosystem fractured early. LangSmith, Langfuse, Arize, Helicone, Datadog LLM Observability — each shipped its own schema. If you instrumented for one, you were locked in. If you wanted to use two, you instrumented twice.
The OpenTelemetry GenAI semantic conventions fix this. They define a standard attribute namespace (gen_ai.*) for LLM operations, agent spans, MCP tool spans, and content events. Most commercial vendors now ingest OTel-formatted spans, and the major SDKs (OpenAI, Anthropic, LangChain, LlamaIndex, AutoGen) ship OTel instrumentation either natively or via a single wrapper.
The pragmatic stance for a 2026 production system:
- Instrument with OpenTelemetry GenAI conventions. Even if the spec is still in development status, the cost of switching from a vendor-specific schema later is much higher than the cost of tracking a stabilizing spec now.
- Use a Collector in the middle. Send spans to your APM and your LLM-specific tool from the same source. Filter, sample, and redact at the Collector — not in app code.
- Treat content as events, not attributes. Prompts, completions, tool arguments, retrieved chunks all go into span events. Then you can sample them down to 1% or 0% in environments where you cannot store user content.
This separation — structural attributes everywhere, content as configurable events — is the difference between a tracing system you can ship to a regulated customer and one you cannot.
What Sampling Looks Like When You Can’t Store Everything
At 1M LLM calls per day, capturing every prompt and completion is impractical. Span storage is cheap; content storage is not, and content is what regulators care about.
A workable production sampling policy:
- 100% of structural spans (attributes only). You always need to know what happened.
- 100% of error and slow traces with content. Errors and tail-latency are where bugs live.
- 5–10% of successful traces with content. Enough to power LLM evals on real production data.
- 0% of content for traces matching PII or regulated-data signatures. The redactor runs at the Collector. The structural span still ships, so you keep latency and cost data.
Most teams overcomplicate this. The goal is not perfect fidelity. The goal is enough signal to debug what is broken and to feed the regression suite. Anything more is a storage bill.
What Breaks in Production
A list of things that look obvious in retrospect and consume weeks of debugging when you do not have them traced.
Silent model upgrades. A provider promotes their latest alias to a new snapshot. Your evals were run against the old one. Quality drops 4% overnight. Without gen_ai.response.model (the returned model, not the requested one) on every span, this is invisible.
Prompt drift between environments. Staging works. Production is wrong. Someone shipped a new prompt version to staging only. A gen_ai.prompt.version attribute on every LLM span makes this a 30-second diagnosis.
Tool argument hallucinations. The model invents a parameter name. The tool returns a 400. The agent retries, then gives up and tells the user “I’m having trouble.” Without the tool span capturing the actual arguments, you have no idea this is happening — only that “the agent feels worse this week.”
Context window starvation. Retrieved chunks exceed the budget. The prompt builder silently truncates the last one. The model answers without the critical context. Retrieval spans showing what was retrieved, plus LLM spans showing token counts, expose this immediately.
Latency from the wrong place. P95 is 8 seconds. The team optimizes prompts and shaves 200ms. The actual cause is one tool that calls a SaaS API that times out 3% of the time. Tool spans with latency distributions tell you in one query.
Cost concentration. Three users generate 40% of token spend. The agent is doing pathological retries on their inputs. LLM spans rolled up by user ID expose this — and feed your cost attribution and per-tenant quota model directly.
Each of these is what S3b (the prompt is not the product) and S3c (before you scale AI: production-ready) are pointing at. The “AI product” is the prompt, the retrieval, the tools, the guardrails, the routing, and the observability that ties them together. If you cannot see the tree of spans for a single bad answer, you do not own a production AI system. You own a demo with traffic.
Tracing a Multi-Agent Workflow: A Worked Example
Consider an agent that handles a customer support ticket. The trace tree:
workflow.support_ticket (root span, 9.2s)
├── retrieval.knowledge_base (180ms)
├── llm.classifier (520ms) → returns category="billing"
├── workflow.billing_agent (7.8s)
│ ├── retrieval.account_history (340ms)
│ ├── llm.billing_reasoner (2.1s)
│ ├── tool.stripe_lookup (4.1s) ← p99 outlier
│ ├── llm.billing_reasoner (1.0s)
│ └── guardrail.pii_redact (90ms)
└── guardrail.tone_check (150ms)
What this trace tells you in one glance:
- The Stripe lookup is the latency hot spot. Not the LLM.
- The billing reasoner ran twice — first to plan the tool call, then to interpret the result. That is a single agent loop, not a bug.
- The classifier added 520ms upstream. If volume is high, route by heuristic first and only LLM-classify the ambiguous ones.
- PII redaction happened before tone check. If the redaction garbled the message and the tone check then flagged it, the user gets a worse response. Span ordering exposes this design choice.
You cannot reason about any of this from log lines. You need spans.
Make Your Production AI Legible
If your team is debugging AI failures by reading log lines or replaying prompts in a notebook, the system has outgrown its tooling. metacto helps engineering teams stand up OpenTelemetry-based LLM tracing, evals, and observability that are vendor-portable and built for the scale you are heading toward, not the scale you started at.
How Tracing Connects to Evals and Rollback
Tracing is the foundation layer. Two systems sit on top of it:
Evals. Your LLM evals regression suite needs inputs. The best inputs are real production traces, sampled, sanitized, and graded. Without tracing, your eval dataset is a synthetic guess at what users actually do.
Rollback. When a prompt change degrades quality, you need to identify which prompt version was in flight, on which slice of traffic, when the regression started. Every LLM span tagging gen_ai.prompt.version and gen_ai.deployment.id turns rollback from a 4-hour incident into a 4-minute config change.
These three — tracing, evals, rollback — are the minimum viable observability stack for any AI system serving paying customers. metacto builds this stack as part of every Operational AI engagement. It is not optional. It is the substrate every other improvement compounds on.
An Implementation Order That Actually Works
If you are starting from zero, the sequence that produces the most signal per engineering hour:
- Week 1: Auto-instrument LLM client libraries with OpenTelemetry. Ship structural attributes only. No content. Wire to an existing APM.
- Week 2: Add tool call spans. This is where the next month of bugs lives.
- Week 3: Add retrieval spans. This is where the quality lives.
- Week 4: Add prompt version and deployment ID attributes. This is where the rollback lives.
- Week 5: Add content as span events, sampled and redacted at the Collector. This is where the evals live.
- Week 6 onward: Use the data. Build dashboards, set alerts, feed the eval suite, route slow traces to triage.
The teams that stall in production almost always have steps 1–4 partially done and skipped to a vendor-specific SaaS for steps 5–6. The result is lock-in, no portability, and dashboards their CFO does not believe. The teams that scale start with OpenTelemetry, keep the data, and add vendors as features on top.
Conclusion
LLM tracing is the price of admission for production AI. Without it, every quality regression is a guess, every cost spike is a fire drill, every customer complaint becomes an archaeology project. With it, your team debugs AI failures the same way they debug distributed systems — by reading the trace.
This is one layer of the system underneath the chat box. The next layers — evals that ship with every release, LLM-as-judge, prompt versioning, observability as a system property — all build on traces. Build this floor first.
LLM Tracing in Production: FAQ
What is LLM tracing and how is it different from logging?
LLM tracing applies distributed tracing — the OpenTelemetry pattern — to LLM and agent workflows, producing a tree of causally-linked spans for a single interaction. Logging captures discrete events. Tracing captures the chain of decisions an AI system made — which retrieval, which model, which tool, which guardrail — so you can debug a bad answer instead of guessing at it.
What should every LLM tracing span capture at minimum?
For LLM call spans: model identity (the exact returned model string), provider, request parameters (temperature, max_tokens, response_format), input and output token counts, finish reason, and latency split into time-to-first-token and total time. For agent systems, add tool call spans, retrieval spans, workflow spans, and guardrail spans — each parented to the LLM call that triggered them.
Why use OpenTelemetry GenAI semantic conventions instead of a vendor SDK?
Vendor schemas lock you in. The OpenTelemetry GenAI semantic conventions define a standard attribute namespace (gen_ai.*) for LLM operations, agent spans, MCP tool spans, and content events. Most LLM observability vendors now ingest OTel-formatted spans, so you can run multiple backends from one instrumentation and change vendors without rewriting dashboards. Even though the spec is still maturing, lock-in is the more expensive bet.
Should I store full prompts and completions in span attributes?
No. Attributes are indexed, size-limited, and exposed to anyone with read access to your tracing backend. The OpenTelemetry GenAI conventions explicitly recommend storing prompt and completion content in span events instead. Events can be sampled, filtered, or redacted at the OpenTelemetry Collector without changing application code — which is what makes the system safe for regulated workloads.
How do I sample LLM traces at production scale?
Capture 100% of structural spans (attributes only), 100% of errors and tail-latency traces with content, 5–10% of successful traces with content, and 0% content for traces matching PII or regulated-data signatures (redacted at the Collector). The structural layer always ships so cost and latency analytics are complete; the content layer is sampled because storage and compliance, not capture, are the binding constraint.
How does LLM tracing connect to evals and prompt rollback?
Tracing is the foundation. Production traces — sampled, sanitized, graded — become the inputs to your evals regression suite. Span attributes for prompt version and deployment ID let you isolate which change caused a quality regression and roll back in minutes instead of hours. Without tracing, evals are synthetic guesses and rollback is an archaeology project.