LLM Rate Limiting: Token Quotas and Cost Control for Production Systems

LLM rate limiting is two problems disguised as one. A guide to upstream provider TPM/RPM limits, internal per-tenant token quotas, request vs token-based throttling, and the AI gateway layer that mediates between them.

5 min read
Garrett Fritz
By Garrett Fritz Partner & CTO
LLM Rate Limiting: Token Quotas and Cost Control for Production Systems

The product had been live for a week. The team had been carefully watching cost — daily charts, slack alerts on the OpenAI spend dashboard, the whole routine. Then on Thursday morning, half the user base started seeing 500 errors. The agent endpoints were returning “Upstream provider rate limit exceeded.” Cost was fine. What had happened?

A single customer had wired their internal automation to the product’s API and was hammering it. They were not abusive; they had simply written a loop that fired a request every 200ms. Their account was nowhere near its monthly token budget. But OpenAI’s per-key TPM ceiling was a fixed number across the whole organization’s traffic, and this one customer’s burst had eaten it all. Every other customer was now waiting for a quota window the team did not control.

This is the moment teams discover that LLM rate limiting is not one problem. It is two problems wearing the same name.

Upstream provider limits (OpenAI’s TPM and RPM, Anthropic’s RPM/ITPM/OTPM, Gemini’s quotas) are caps on what your API key can pull from the provider. They protect the provider’s infrastructure. They are fixed by your tier and your relationship; you negotiate them, but you do not design them.

Internal per-tenant quotas are caps your application enforces on what each customer can consume. They protect your unit economics and your fairness story. They are entirely your design choice.

Most teams build the first by accident (because the provider’s 429 errors force them to) and forget the second until it costs them a customer or a runaway bill. This guide covers both, and the AI gateway layer that increasingly mediates between them. It is part of the broader question of why your AI experiments are failing — rate limiting is one of the layers of the system underneath the chat box that quietly decides whether your product survives a traffic spike.

The Two Rate Limit Problems

A clean way to keep them straight:

Upstream provider limitsInternal per-tenant quotas
Who sets itOpenAI, Anthropic, GoogleYou
What it protectsProvider infrastructureYour unit economics, your fairness
What it capsAll traffic from your API keyEach tenant’s consumption
GranularityPer organization or per keyPer tenant, per feature, per user
Time windowPer minute (and per day)Per minute, hour, day, month
Failure mode429 from providerYour own 429 (or downgrade)
Design decisionChoose tier, request increasesChoose budget shape, enforcement

If you only solve one, you will hit the other in production. The first will surface as cascading 429s during traffic spikes. The second will surface as a single tenant exhausting capacity for everyone, or as a finance team asking why one customer paid $200 and consumed $2,000 of inference.

Upstream Provider Limits: What the Big Three Actually Enforce

The cap shape varies meaningfully across providers, which matters when you are routing across them. Numbers below are illustrative of early-2026 tier limits per public docs and recent comparisons; treat your provider dashboard as the source of truth.

OpenAI

OpenAI uses tier-based limits keyed to RPM (requests per minute) and TPM (tokens per minute), scaling with cumulative spend and account age. As OpenAI’s rate limits documentation describes, tiers upgrade automatically as usage grows, but exact limits vary by model. As of early 2026, GPT-5 Tier 1 typically offers around 500K TPM and roughly 1,000 RPM, with Tier 5 (reachable after roughly $1,000 of spend over 30+ days) offering 10,000 RPM and millions of TPM.

The headline rule: in practice TPM is the limit that matters most for production, because real workloads usually hit token ceilings before request ceilings.

Anthropic

Anthropic uses a spend-based tier system with limits measured separately as RPM, input tokens per minute (ITPM), and output tokens per minute (OTPM). A single combined “TPM” number can be misleading — the per-direction split means input-heavy workloads (long-context RAG) and output-heavy workloads (generation) hit different ceilings. According to recent comparisons, Tier 1 starts at roughly 40K TPM for Opus, 80K for Sonnet, and 100K for Haiku; Tier 4 scales to 2M TPM for Sonnet and Haiku and 1M for Opus.

If you mostly call Anthropic for long-context retrieval, your ITPM is the bottleneck. If you mostly use it for code generation, your OTPM is. Plan capacity against the direction that matters for your workload, not a single rolled-up number.

Google Gemini

Gemini’s free tier is generous for prototyping; the paid tier scales by project and by model. Quotas are visible in the Cloud console and can be requested upward. Like the others, the limit you actually hit depends on workload shape.

What this means for your architecture

Three implications worth designing for:

  • The provider limit is one number across your whole traffic. If you serve multiple tenants on one API key, they compete for the same window. Allocating headroom across tenants is a design problem you own, not one the provider solves.
  • Tier upgrades take time. You cannot resolve a Friday-afternoon spike by emailing OpenAI; the tier rises with sustained spend, not on demand. Design for the limits you have now, not the ones you wish you had.
  • The per-direction split (Anthropic) and the model-specific tiers complicate single-knob thinking. Your effective capacity is a vector, not a scalar.

Internal Per-Tenant Quotas: The Budget You Actually Control

If upstream limits are the cap on system input, per-tenant quotas are the cap on tenant input. They serve four purposes simultaneously:

  1. Cost control. Each tenant’s spend is bounded by a number you set, not by their imagination.
  2. Fairness. No tenant can starve others on shared provider capacity. This is critical in multi-tenant architectures — see our guide to multi-tenant AI applications.
  3. Plan enforcement. Free, pro, and enterprise tiers have meaningfully different ceilings, and the ceiling is the product feature you sell.
  4. Abuse and runaway protection. A buggy customer integration or a stuck agent loop hits a wall instead of consuming your entire budget.

The mistake is treating per-tenant quotas as one knob. They are at least three:

  • Token budget. Total tokens (input + output) per time window. The natural unit of cost.
  • Request rate. Requests per minute. Protects you from concurrency spikes even when tokens are within budget.
  • Concurrency cap. Simultaneous in-flight calls per tenant. Protects you from one tenant eating your connection pool.

Tenants can be within their token budget and still take down the system via concurrency. They can be within concurrency caps and still blow the monthly budget. You need all three.

Request-Based vs Token-Based Limiting

A common engineering mistake is using only request-based limiting because it is the default everywhere. Express middleware, API gateways, and standard rate-limiter libraries all count requests. LLM costs do not.

Request-based limiting caps requests per time window. Easy to implement, easy to reason about, available in every gateway. It fails the LLM use case because requests are not the cost unit. A single 100K-token RAG query costs as much as 1,000 short chats. Capping requests means the small chats hit the wall while the expensive query sails through.

Token-based limiting caps tokens per time window. Closer to actual cost. Harder to implement because you do not know the output token count until the response completes. You estimate or reserve before the call and reconcile after.

The pragmatic pattern: both, with token-based as the primary enforcement and request-based as a guardrail. Token budget answers “how much did this tenant spend?” Request rate answers “is this tenant trying to overwhelm the gateway?” You need both signals; they catch different failure modes.

Implementation notes that hold up in production:

  • Reserve before, reconcile after. Before sending the request, debit the budget by your estimated token cost (typically prompt tokens + a conservative output estimate). After the response arrives, reconcile to actual usage. This prevents over-commitment under concurrency.
  • Sliding windows beat fixed buckets. Fixed-bucket rate limiting allows traffic at the boundary (a tenant can burst at 11:59 and 12:00). Sliding windows smooth the burst.
  • Token estimates should be conservative. Underestimating output tokens during reservation lets tenants burst past their cap. Round up.
  • Separate input and output budgets if you serve Anthropic-style workloads. The provider already does; mirror the split internally where it matters.
  • Apply quotas at the gateway, not in each service. Per-service quotas are how the same tenant ends up double-budgeted in confusing ways.

The Refund Problem

A tenant whose request fails mid-stream (because the provider returned a 429, or because your own quota kicked in late) often has already consumed input tokens. Decide explicitly whether failed requests count against the budget. Refunding gives the tenant a better experience but creates an obvious abuse vector (fire 10x requests, get billed for 1x). Counting them is fairer but feels punitive on transient failures. Pick one, document it, and surface it to the tenant.

The AI Gateway Pattern

By 2026, mature production AI systems have converged on a pattern: a dedicated AI gateway sits between application code and provider APIs, mediating rate limiting, quotas, routing, retries, caching, and observability. LiteLLM, Portkey, Kong AI Gateway, and AWS Bedrock’s gateway functionality all instantiate the pattern; many teams build their own thin version.

What the gateway is responsible for:

  • Per-tenant authentication and quota enforcement. The gateway is the only thing with the provider’s API key. Application code authenticates to the gateway with a tenant token; the gateway maps that to the provider call and enforces the tenant’s budget.
  • Provider abstraction. The same gateway endpoint can route to OpenAI, Anthropic, or a self-hosted model based on availability, cost, and policy. This is where routing logic lives; see LLM routing in production for the full design.
  • Retries with jittered backoff. Provider 429s are recoverable if you retry correctly. The gateway handles this so every service does not reimplement it.
  • Adaptive concurrency. When the gateway sees the upstream provider slowing down, it reduces in-flight concurrency before the provider returns 429s. This is the difference between graceful degradation and falling over.
  • Cost attribution. Every call is tagged with tenant, user, feature, and routed model, producing the data you need for billing and budgeting. The detail design is in LLM cost attribution per user, feature, and tenant.
  • Caching. Prompt-level and semantic caches live at the gateway, deflecting calls before they hit the provider — and before they consume your token quota.

The reason this pattern keeps winning is that all of these concerns share state. Rate limiting needs to know which tenant. Retries need to know which model and which budget. Cost attribution needs the same per-call context. Putting them in one place is cheaper than coordinating them across services.

Design LLM Rate Limiting That Survives Production

LLM rate limiting is two problems disguised as one — and most teams discover the second the hard way. Talk with our team about designing per-tenant token quotas, AI gateway patterns, and the upstream provider headroom strategy your roadmap will actually need.

Allocating Upstream Headroom Across Tenants

The hardest problem in multi-tenant LLM systems is sharing one upstream TPM ceiling across N tenants fairly. A few patterns that hold up:

Reserved + burstable. Each tenant gets a guaranteed slice of the upstream TPM (say, 10% of total) plus access to a shared burst pool when other tenants are quiet. This is how cloud providers think about IOPS. The reserved slice protects fairness; the burst pool keeps utilization high.

Priority queues. Enterprise tenants on the SLA tier get scheduled before self-serve tenants when contention exists. Implementable cleanly at the gateway by reading a tenant priority field and queueing differently.

Backpressure and degradation. When approaching upstream limits, the gateway returns “model unavailable” only to lower-priority tenants, or routes them to a cheaper, less-congested model. Better UX than 500s; signals the tenant that they are above their normal envelope.

Spillover routing. When the primary provider is saturated, route to a backup provider with the same or comparable model. This is where the provider abstraction at the gateway pays for itself: spillover is a config change, not a code change. The cost-quality tradeoff is significant and worth thinking through.

Per-model partitioning. Different upstream limits apply to different models. If your Sonnet quota is full but Haiku is fine, downgrade selected workloads. Per-model awareness lives at the gateway.

What Breaks in Production

A short list of rate-limit failures we see repeatedly, with the design responses.

Single hot tenant exhausts upstream limits. Without per-tenant quotas, one customer’s burst eats everyone’s capacity. Add per-tenant token budgets and concurrency caps before they exist.

Estimated tokens diverge from actual. The gateway reserves 1K output tokens; the model returns 8K. Tenants exceed budgets the operator thought were enforced. Either reserve conservatively (round up estimates) or reconcile aggressively (revoke the next request as soon as the discrepancy is known).

Retries amplify outages. Provider returns 429; gateway retries; provider returns 429 again; gateway retries again. Naive retries multiply the load that caused the limit. Use jittered exponential backoff and cap retry count per request.

Background jobs ignore quotas. A nightly batch job uses the same API key without the tenant context. It consumes upstream capacity that breaks daytime user-facing traffic. Background jobs must route through the same gateway and have their own quota allocation.

Concurrency limits are global, not per-tenant. The system caps total in-flight requests but lets one tenant occupy all of them. Per-tenant concurrency caps prevent this; global caps without them are an illusion of fairness.

Quota enforcement is too coarse for the use case. Daily token budgets feel fine until a customer realizes they can blow the entire day’s budget in 60 seconds. Hour-level or minute-level windows protect against burst abuse even when the daily total looks fine.

No surfacing to the tenant. The tenant has no idea they are near their quota until they hit it. Expose usage in dashboards and headers; let the tenant manage their own spend before they hit a wall.

What Good Looks Like

A production-ready LLM rate limiting and quota architecture has these properties:

  1. There is one AI gateway (built or bought) that all LLM traffic flows through.
  2. Per-tenant quotas exist along three axes: token budget, request rate, concurrency. All three are enforced.
  3. Budgets are reserved before the call and reconciled against actual usage after.
  4. Upstream provider headroom is allocated across tenants explicitly, with reserved slices and a shared burst pool.
  5. Retries use jittered exponential backoff with a per-request cap, not naive immediate retries.
  6. Background jobs route through the same gateway and have their own quota allocation.
  7. Tenants can see their current usage and remaining budget in real time.
  8. Failure modes — 429s, quota exhaustion, provider outages — degrade gracefully via routing and downgrade, not via 500s.

This is one layer of the system underneath the chat box — the gap between a working demo and a multi-tenant product that survives a traffic spike. It is closely tied to multi-tenant architecture, agent secrets management, and LLM routing, and it is one of the practical questions buried in before you scale: making AI production-ready. If your team is sizing this work, it is part of what we build as Operational AI.

Frequently Asked Questions

What is the difference between TPM and RPM in LLM rate limits?

RPM (requests per minute) caps how many API calls you can make. TPM (tokens per minute) caps how many tokens those calls can consume. They are independent — a workload can be within RPM but hit TPM (one large request), or within TPM but hit RPM (many tiny requests). In practice, TPM is usually the binding constraint for production workloads because costs and provider load scale with tokens, not request count.

Should I use request-based or token-based rate limiting for LLMs?

Both, with token-based as the primary enforcement. Token budgets reflect actual cost — a single long RAG query can equal a thousand short chats — so capping tokens enforces real economic fairness. Request rate limits catch concurrency and abuse patterns that token budgets miss. Use token-based budgets for cost and fairness, request-based limits as a guardrail against burst behavior. Both are necessary.

How do I prevent one tenant from exhausting our upstream OpenAI or Anthropic rate limit?

Per-tenant quotas at your AI gateway, enforced before requests leave for the provider. Each tenant gets their own token budget, request rate, and concurrency cap. Allocate upstream headroom explicitly — a reserved slice per tenant plus a shared burst pool — so no single tenant can consume the entire shared TPM ceiling. Without explicit allocation, upstream limits are effectively first-come-first-served, which is not the same as fair.

What is an AI gateway and do I need one?

An AI gateway is a service that sits between application code and provider APIs, handling rate limiting, per-tenant quotas, retries, routing across providers, cost attribution, and caching. LiteLLM, Portkey, and Kong AI Gateway are common examples; many teams build a thin custom version. You need one once you have more than one tenant, more than one model, or more than one feature consuming LLM calls — which is roughly the moment you have a real production AI product.

How should I handle 429 errors from OpenAI or Anthropic in production?

Retry with jittered exponential backoff, capped at a small number of attempts per request. Naive immediate retries amplify the load that triggered the limit and can take you longer to recover. Couple retries with adaptive concurrency at the gateway — reduce in-flight requests when you see latency or 429s climbing — so the system degrades gracefully rather than queuing infinitely. For long outages, route to a backup provider if your gateway supports it.

Should failed LLM requests count against a tenant's token budget?

Decide explicitly and document the decision. Counting them is fairer to other tenants and prevents a refund-loop abuse vector. Refunding them feels better to the affected tenant on transient failures. The pragmatic middle ground is to count input tokens regardless (they were consumed) but refund output tokens on requests that failed before generation. Surface the policy clearly in your API documentation so customers are not surprised.

Share this article

LinkedIn
Garrett Fritz

Garrett Fritz

Partner & CTO

Garrett Fritz combines the precision of aerospace engineering with entrepreneurial innovation to deliver transformative technology solutions at metacto. As Partner and CTO, he leverages his MIT education and extensive startup experience to guide companies through complex digital transformations. His unique systems-thinking approach, developed through aerospace engineering training, enables him to build scalable, reliable mobile applications that achieve significant business outcomes while maintaining cost-effectiveness.

View full profile

Ready to Build Your App?

Turn your ideas into reality with our expert development team. Let's discuss your project and create a roadmap to success.

No spam 100% secure Quick response