The CFO opened the meeting with a screenshot of the OpenAI invoice. The number had tripled in eight weeks. Then she asked the question every engineering leader dreads: “Which customers are responsible for this?” The room was quiet. Not because nobody cared — because nobody could tell. The product had been instrumented for latency and errors, not for who, what, and why every token had been spent.
This is the moment cost attribution stops being a “nice to have” and starts being the blocker between a working AI product and a viable AI business. You cannot price what you cannot measure. You cannot optimize what you cannot trace. And in multi-tenant SaaS — where one heavy customer can subsidize their plan with another customer’s margin — flying blind on cost is how a healthy-looking ARR line meets a collapsing gross margin.
This guide is for engineering and product teams whose LLM costs have crossed the threshold where “look at the invoice” no longer answers any useful question. It covers the dimensions that matter (user, feature, tenant, model, route), the four-layer token accounting model that has become best practice, the metering pipeline that captures it, and how attribution data flows into pricing, rate limiting, and unit economics. It sits inside the larger question of why AI experiments fail in production — and it is one of the clearest signals separating demos from systems that ship.
What “Cost Attribution” Actually Means
Cost attribution is not the line item on the invoice. The provider tells you how many tokens you bought, by model and (sometimes) by API key. That’s cost reporting. Attribution is the harder problem: mapping that spend to the dimensions your business actually runs on — customers, features, workflows, teams, agent runs — so that pricing, capacity, and optimization decisions can be made with evidence.
The minimum dimensions a production attribution system should answer for any window of time:
- Per user: which end-users drove which spend, especially the long tail of heavy users
- Per feature: which product surfaces (chat, summarization, agent runs, batch jobs) consumed the budget
- Per tenant / customer: in multi-tenant SaaS, which accounts are profitable and which are not
- Per model and per route: which model choices are paying off and which are overkill for their task
- Per agent run / workflow: which long-running tasks ate disproportionate tokens
- Per layer: prompt, retrieved context, tool outputs, memory, response — separately
Braintrust’s 2026 playbook on tracking LLM costs and CloudZero’s customer-cost framing converge on the same point: cost data without dimensions is reporting; cost data with dimensions is leverage.
Cost Per Customer Is the Margin Question
In subscription SaaS, average cost per customer is a lie. Distributions are skewed: a small fraction of users typically drives a large fraction of LLM spend. Without per-customer attribution, you can’t tell whether your AI feature has 80% gross margin (great) or whether five outlier customers are eating the spread. Most teams discover the shape of this curve only after they instrument it.
The Four-Layer Token Accounting Model
The most useful framing to emerge in the last year is that the “input tokens / output tokens” buckets the provider exposes are too coarse to optimize against. Every model call actually consumes four distinct layers, each with different cost behavior and different optimization levers.
The categorization, popularized in SoftwareSeni’s multi-tenant cost governance writeup and echoed across the cost-tracking literature:
- Prompt tokens — the system prompt, instructions, and any static scaffolding. Mostly fixed per route. Compressible via prompt versioning and template optimization.
- Tool tokens — tool definitions exposed to the model plus tool call outputs that re-enter the context. Often the silent dominant cost in agentic workloads.
- Memory / context tokens — retrieved chunks, memories, prior turns, scratchpad contents. The variable that scales with conversation length and retrieval depth.
- Response tokens — what the model generates. Output tokens are usually priced higher than input but make up the smallest layer in most workloads.
Why this matters: each layer responds to different interventions. Prompt tokens shrink with prompt versioning discipline. Tool tokens shrink with tighter tool exposure and tool output compression. Memory and context tokens shrink with the context management framework — write, select, compress, isolate. Response tokens are mostly a function of task design.
Bucketing all four into “input tokens” hides where the spend lives and prevents targeted optimization. A production-grade attribution system reports them separately.
What to Tag, and Where
Cost attribution depends entirely on what you tag at request time. Once a call has hit the provider without a tenant ID attached, the attribution is gone — and rebuilding it later from billing reports is at best a sampling exercise.
The minimum tags that should land on every span:
user_idandcustomer_id/tenant_id(both — they’re different in B2B SaaS)featureorroute(the product surface that triggered the call)agent_run_idandworkflow_step(for multi-step agents)modelandmodel_versionenvironment(prod / staging — you do not want them mixed in cost reports)request_origin(user-triggered, scheduled, system-initiated)- A correlation ID that ties this call to the upstream user action
These tags belong on the span at creation time, not bolted on later. The discipline is: tag at the harness layer (the wrapper around your LLM SDK call), not in each feature’s code. If a feature can call the model without going through the harness, your attribution is already incomplete.
OpenTelemetry GenAI Conventions: The Standard You Should Use
The good news for teams starting today: the standardization problem is largely solved. The OpenTelemetry GenAI semantic conventions define a vendor-neutral schema for spans, metrics, and attributes covering token usage, model identity, operation duration, and tool calls.
Key conventions worth knowing:
gen_ai.usage.input_tokensandgen_ai.usage.output_tokens— token counts on every spangen_ai.request.modelandgen_ai.response.model— what was requested and what was actually billedgen_ai.operation.name— the operation type (chat, embeddings, tool call)gen_ai.provider.name— Anthropic, OpenAI, Google, etc.gen_ai.client.token.usage— the metric for aggregated token consumptiongen_ai.client.operation.duration— for latency-cost correlation
The spec explicitly addresses billable tokens: when the provider reports both used and billable tokens, instrumentation MUST report billable. This matters for prompt caching (where input tokens are cheaper than they appear) and for reasoning models (where reasoning tokens are billed but invisible to the application).
Datadog, Honeycomb, OneUptime, Traceloop, Langfuse, Helicone, and the broader observability ecosystem have aligned on these conventions. Building your metering on top of OTel GenAI means you can swap vendors without re-instrumenting — which matters because the cost-tracking tools space is consolidating faster than the prices it’s tracking. This is the same instrumentation layer that powers LLM tracing in production; cost attribution and tracing are two views of the same data.
The Metering Pipeline: From Span to Bill
The architecture that has settled into production looks like a four-stage pipeline. Each stage has a different concern and a different reliability profile.
Stage 1 — Capture. A thin harness around the model SDK creates a span on every call, attaches the tags, records the token counts (from the provider response, never estimated), and emits via OTel. Reliability requirement: this must not fail the user request. Cost telemetry that can take down the product is worse than no cost telemetry.
Stage 2 — Enrich. Spans flow into a stream processor that joins user/tenant lookups, computes cost in your accounting currency from a versioned price table, and assigns spans to billing dimensions. Pricing changes — versioning the price table is how you reconcile retroactively when a vendor changes rates.
Stage 3 — Aggregate. Enriched spans land in an analytics store (a columnar warehouse like Snowflake or BigQuery, or a purpose-built observability backend like Helicone or Langfuse). Aggregations roll up to dashboards by user, tenant, feature, model, and time.
Stage 4 — Act. Attribution data feeds three downstream consumers: dashboards (for FinOps and product visibility), rate limiting and token quota enforcement (to enforce per-tenant ceilings), and the billing system (when AI usage is metered to customers).
The mistake to avoid: trying to do attribution from billing reports rather than from spans. Provider invoices are aggregated, delayed, and untagged with your business dimensions. They are useful for reconciliation; they cannot be the source of truth for attribution.
Per-Tenant Cost Governance in Multi-Tenant SaaS
Multi-tenant LLM products have a margin problem that single-tenant systems don’t: tenant cost distribution is skewed and adversarial usage exists. The patterns that hold up:
Daily spend caps per tenant, with automated rate-limit tightening on threshold approach. Caps are not the same as hard cutoffs — most production systems tighten quotas progressively, so a heavy tenant’s experience degrades before it disappears.
Per-tenant baselines with anomaly alerts. A tenant whose daily spend deviates significantly from their 7-day rolling baseline is either growing or being abused. Both are worth a human looking.
Cost-to-revenue ratio at the tenant level. The right metric is not “what did this tenant cost” — it’s “what is the gross margin on this tenant’s revenue?” When that number turns red, the product, pricing, or workflow needs to change.
Pricing surface honesty. If your AI features are sold as “unlimited,” cost attribution will eventually force a conversation. The teams that handle this well move to usage-based pricing or fair-use tiers before the margin compression bites; the teams that don’t end up subsidizing the heaviest users with everyone else’s spend.
Opsmeter’s per-tenant margin operating model makes the broader case well: token attribution is not an observability problem, it’s a unit economics problem.
This is also where attribution intersects multi-tenant AI application architecture. The same tenant boundary that protects data isolation should be the one enforcing cost isolation. If they aren’t the same boundary, one of them is leaking.
Closing the Cost Attribution Gap?
The path from invoice to insight is metering, tagging, and a pipeline that connects token spend to unit economics. metacto helps engineering teams stand up production cost attribution that holds up at scale — and turns it into a pricing and optimization lever, not just a dashboard.
Connecting Attribution to Optimization
Attribution data is only valuable if it changes what gets built. The optimization paths it unlocks, roughly in order of typical impact:
Right-sized model routing. Once you can see cost per feature, the “we’re using GPT-class for everything” pattern becomes indefensible. Cheap models handle simple intents; expensive models are reserved for cases where the quality delta justifies the cost delta. See LLM routing in production for the routing patterns.
Caching where it matters. Cost attribution reveals which routes have high prompt overlap (and therefore high cache-hit potential). Prompt and semantic caching strategies are evaluated against cost-per-call, not theoretical hit rates.
Prompt and context discipline. When you can see the four-layer breakdown, the bloated system prompt or the over-eager retrieval that’s burning the budget becomes visible. The fix lives in prompt versioning and context management.
Memory ROI. Are your agent memory writes reducing total spend (by avoiding re-explanation) or increasing it (by inflating prompts)? Per-feature attribution is the only way to know.
Tool tax reduction. Tool tokens are the silent budget killer in agentic workloads. Attribution that breaks out tool tokens shows which tools are paying their way and which are dead weight.
The compounding effect of these optimizations is usually larger than the team predicts. Teams that instrument cost properly commonly find 30 to 50% reduction opportunities once the four-layer view is visible.
A Pragmatic Implementation Sequence
For teams whose LLM costs have outgrown their visibility:
- Add the harness. Wrap every LLM call through a single client that attaches user, tenant, feature, and model_version tags. Reject calls that bypass it.
- Adopt OTel GenAI conventions. Use the standard attribute names. You’ll thank yourself when you swap observability tools.
- Stand up the four-layer breakdown. Track prompt, tool, memory, and response tokens separately. This is where the optimization decisions live.
- Build the per-tenant dashboard. Daily spend, baseline deviation, cost-to-revenue ratio. Make it visible to engineering and finance both.
- Wire spend caps and quotas. Attribution without enforcement is theater. Per-tenant caps are how you prevent the outlier from becoming the incident.
- Close the loop to optimization. Use attribution data to drive routing, caching, prompt tightening, and memory decisions. Re-measure after every change.
The teams that ship cost attribution well treat it as production infrastructure: instrumented at the harness, standardized on OTel, and consumed by both dashboards and enforcement. The teams that don’t ship a screenshot of last month’s invoice and a hope that this month is better — which is, in practice, how shelfware happens on the cost dimension.
If you’re standing this up across multiple products or multiple tenants, the architectural decisions compound. Our Operational AI solutions include exactly this layer: turning the spend question from “what was the bill?” into “what is the unit economics of this product?”
Frequently Asked Questions About LLM Cost Attribution
What is LLM cost attribution?
Cost attribution is the practice of mapping LLM spend to the dimensions your business runs on — users, features, tenants, models, routes, and agent runs — rather than just the aggregated number on the provider invoice. It captures token usage at the request level with business-relevant tags so you can answer 'which customers, features, or workflows are driving this spend?' The provider tells you the bill; attribution tells you why.
How do I track LLM cost per user in a multi-tenant SaaS application?
Tag every LLM request at creation time with user_id and tenant_id at the harness layer (the wrapper around your model SDK calls), not in each feature's code. Use OpenTelemetry GenAI semantic conventions so the attributes are standardized. Pipe spans through a stream that joins user lookups, computes cost from a versioned price table, and aggregates into per-user and per-tenant dashboards. Without tags applied at the request, no later report can reconstruct the attribution.
What is the four-layer token accounting model?
Four-layer accounting separates token spend into prompt tokens (system instructions and scaffolding), tool tokens (tool definitions and tool call outputs), memory and context tokens (retrieved chunks, memories, prior turns), and response tokens (model generations). Each layer responds to different optimization levers — prompt versioning, tool exposure tightening, context management, task design. The traditional input/output split hides where the spend lives and prevents targeted optimization.
How do OpenTelemetry GenAI conventions help with LLM cost tracking?
The OTel GenAI semantic conventions define standard attribute names like gen_ai.usage.input_tokens, gen_ai.request.model, gen_ai.operation.name, and gen_ai.provider.name that the observability ecosystem (Datadog, Honeycomb, Langfuse, Helicone, Traceloop) has aligned on. Instrumenting against the standard means cost data is portable across vendors. The spec also requires reporting billable tokens when providers expose them, which matters for prompt caching and reasoning models.
How do I prevent runaway LLM costs in multi-tenant products?
Combine attribution with enforcement: per-tenant daily spend caps with progressive rate-limit tightening as thresholds approach, anomaly alerts on per-tenant baseline deviation, and cost-to-revenue ratio monitoring at the tenant level. Attribution without enforcement is theater. The same tenant boundary that protects data isolation should enforce cost isolation; if they're not the same boundary, one of them is leaking.
What tools can I use for LLM cost attribution in production?
The cost-tracking ecosystem in 2026 includes Helicone, Langfuse, Traceloop, Helicone, OpenMeter, Portkey, and Vantage on the observability side, with CloudZero and similar platforms handling the FinOps roll-up to business dimensions. Datadog and other APM vendors have added native LLM cost views. The common substrate is OpenTelemetry GenAI semantic conventions — building on the standard lets you swap vendors without re-instrumenting.
How does cost attribution connect to AI SaaS pricing?
You cannot price what you cannot measure. Cost attribution surfaces the distribution of cost per customer — which is typically skewed, with a small fraction of users driving a large fraction of spend. That data drives the conversation about usage-based pricing, fair-use tiers, and whether features sold as 'unlimited' have a viable margin profile. Teams that instrument before they price avoid the painful retrofit when margin compression bites.