Most production AI systems start with one model and one API key. They end with three providers, eleven model versions, six different fallback paths, and an on-call engineer asking why an internal tool is suddenly answering customer questions with Claude when the runbook says GPT.
That is what unmanaged routing looks like. It is also why “we just call the API” stops scaling around the time your bill crosses five figures a month or your first model-side incident knocks an entire feature offline. LLM routing is the discipline of deciding, per request, which model handles it — and making that decision explicit, observable, and reversible.
This guide covers what production LLM routing actually requires: the tiers, the fallbacks, the gateways, and the most under-discussed failure mode of all — what happens when the model you pinned to silently changes under you. It is one layer of the system underneath the chat box, and it is part of the larger question of why your AI experiments are failing once they leave the demo environment.
What LLM Routing Actually Means
LLM routing is the request-time decision logic that maps an incoming prompt to a specific model invocation. At minimum, it answers four questions:
- Which provider — OpenAI, Anthropic, Google, Mistral, an in-house model, a fine-tune?
- Which model tier — frontier, mid-tier, small/fast, or specialized (vision, embeddings, reasoning)?
- Which exact version — the dated snapshot, not a moving alias.
- What happens when that target fails — fallback, retry, degrade, or surface an error.
In a one-model system, the answer to all four is implicit and hardcoded. In a production system serving real users, those four answers vary by feature, by tenant, by request type, sometimes by time of day. Routing is what turns a pile of if/else statements into a policy.
The reason it matters is economics and reliability. The frontier-model price gap is roughly 10–30x between the cheapest capable model and the most expensive one for the same task. Sending every request to the top tier is the most common cause of runaway LLM spend — a pattern we have written about extensively in our breakdowns of Anthropic’s API pricing, Google Gemini’s true cost, and the broader topic of AI cost optimization.
The Three Routing Dimensions
Production routing decisions hinge on three axes. A mature router treats all three explicitly.
Capability Routing
Match the task to the cheapest model that can actually do it. The hard part is defining “can do it” — that is what an eval suite is for. In practice, teams converge on a tiered approach:
| Tier | Typical use cases | Cost profile |
|---|---|---|
| Frontier (e.g., Claude Opus, GPT-5 Pro) | Complex reasoning, code generation, multi-step agentic tasks, judge prompts | $$$$ |
| Mid-tier (e.g., Claude Sonnet, GPT-5) | Most production traffic — RAG synthesis, summarization, structured extraction | $$ |
| Small/fast (e.g., Haiku, GPT-5 mini, Gemini Flash) | Classification, routing decisions themselves, simple chat turns | $ |
| Specialized | Embeddings, vision, long-context tasks | varies |
The mistake is choosing the tier statically per feature. The right move is choosing per request, often with a small classifier that runs on the small/fast tier to decide whether to escalate.
Cost Routing
Cost routing layers a budget constraint on top of capability. Examples:
- Free-tier users get the small/fast tier; paid users get the mid-tier; enterprise users get the frontier tier.
- Internal experimentation gets capped at $X/day per developer.
- Batch jobs route to whatever provider has cheaper offline pricing.
This connects directly to per-tenant and per-feature accounting, which we cover in detail in LLM cost attribution. You cannot enforce a budget you cannot see.
Latency Routing
For user-facing interactions, p95 latency is a first-class constraint. Frontier models are often slower, not just more expensive. Routes that need sub-second time-to-first-token (streaming chat, autocomplete) often have to route to the small/fast tier even when frontier capability would be preferable. Latency routing also covers regional concerns — pinning EU traffic to EU-hosted endpoints for both speed and data residency.
The routing decision is itself an LLM cost
If you use an LLM to classify requests before routing, that classifier is part of the budget. Keep it on the cheapest tier, cache its decisions where possible, and measure whether it is paying for itself. A bad classifier that escalates 90% of traffic to the frontier tier costs more than no router at all.
LLM Gateways: What They Do and What They Do Not
Most teams do not build their routing layer from scratch. They use an LLM gateway — a thin proxy that sits between your application and the model providers and centralizes the routing, retries, caching, and observability.
The three names you will hear in 2026 are LiteLLM, Portkey, and OpenRouter. They occupy distinct slots:
| Gateway | Posture | Where it fits |
|---|---|---|
| LiteLLM | Open-source, self-hosted, OpenAI-compatible API across 100+ providers | Teams that want full control, on their own infra, with their own keys |
| Portkey | Hybrid — open-source core (Apache 2.0 as of March 2026) plus a managed platform with guardrails and analytics | Teams that want production guardrails (PII redaction, jailbreak detection) without building them |
| OpenRouter | Fully managed, single-key access to many models, pay-as-you-go | Prototypes, internal tools, teams that want zero ops at the cost of routing through a third party |
The trade-offs are well documented — a recent comparison puts Portkey’s added latency under 1ms, LiteLLM around 8ms p95, and OpenRouter in the 100–150ms range because traffic crosses an extra public hop. Latency overhead matters most for streaming chat; it matters less for batch summarization.
What a gateway gives you out of the box:
- One API surface for many providers (usually OpenAI-compatible).
- Provider keys centralized — your application never holds raw provider credentials.
- Built-in retry and fallback logic with configurable conditions.
- Per-model rate limits and budgets, often per-tenant.
- Request and response logging with token accounting.
- A consistent place to layer caching and guardrails.
What a gateway does not give you:
- A capability policy. It does not know which model is “good enough” for your task. That is your eval suite’s job.
- Eval integration. You still have to test prompt+model combinations and decide which is the production target.
- Business logic. Per-tenant policies, feature flags, and product-tier routing are still your code.
The gateway is plumbing. The policy is yours.
Fallback Strategy: Designing for Provider Failure
Every major model provider has had multi-hour outages. Multiple times. If your product depends on one of them having 100% uptime, your product has worse uptime than that provider.
A production fallback policy answers three questions:
1. What counts as failure?
- Hard failure: HTTP 5xx, timeouts, connection errors.
- Soft failure: rate-limit responses (429), context-length errors, refused completions, output that fails a schema check.
- Quality failure: output that passes schema but fails an eval (a harder problem — usually handled by evals and a regression suite, not by routing).
2. What do you fall back to?
Three common patterns:
- Same-tier, different provider. Claude Sonnet fails → GPT-5 picks it up. Keeps capability roughly constant, isolates you from a single provider’s incidents. Requires that you keep prompts portable.
- Same-provider, lower tier. GPT-5 fails → GPT-5 mini handles it. Faster recovery, lower cost, but lower capability. Acceptable when the task is forgiving.
- Degraded path. Frontier model fails → return a templated response, a cached answer, or escalate to a human queue. The right move for high-stakes flows where a wrong answer is worse than a delayed answer.
3. How fast do you give up?
Aggressive timeouts (5–10s for non-streaming, 30s for streaming) plus a single retry on the original provider, then immediate fallback. Long retry storms during a provider incident make the incident worse, not better — for you and for the provider.
Fallbacks are silent until they are not
A well-designed fallback path means a Claude outage looks like a small latency blip on your dashboard instead of a full incident. That is the goal — until you discover your fallback prompt has been broken for three months because no one tested the path. Test fallbacks the way you test backups: by actually triggering them on a schedule.
Model Version Pinning
This is the section most teams wish they had read six months earlier.
Every major provider exposes two kinds of model identifier:
- Snapshot identifiers —
gpt-4o-2024-11-20,claude-opus-4-5-20251101,gemini-1.5-pro-002. These are pinned to a specific weights release. - Aliases —
gpt-4o,claude-opus-latest,gemini-pro. These point at whatever the provider’s current default is.
Anthropic’s documentation is explicit: every Claude model ID with a date is a pinned snapshot; aliases are rolling pointers. OpenAI works the same way with its dated snapshots. The recommendation in every provider’s production guidance is the same — pin the snapshot.
The reason is simple and is the most predictable cause of unexplained quality regressions in production AI systems: when an alias rolls to a new default, your prompts, evals, and downstream parsers do not know about it. You ship no code, deploy nothing, and your output distribution changes overnight. We have seen — and the industry has seen — production automation pipelines that had been running reliably for weeks suddenly produce incoherent results after a model behind an alias was updated.
A pinning and upgrade discipline that survives contact with reality
- Pin every production call to a dated snapshot. No aliases in production code, ever. This is a lint rule, not a guideline.
- Pin in configuration, not source code. Store the model ID in an environment variable or feature-flag config so you can roll it without a deploy.
- Track provider deprecation calendars. OpenAI and Anthropic both publish retirement dates per snapshot. Add them to a calendar. Owners get notified 30 days out.
- When a new snapshot ships, treat it as a code change. Run your full eval suite against the new snapshot on a side branch. Diff the outputs. Compare cost. Compare latency. Only then move production traffic.
- Shadow-test for at least a few days. Mirror a percentage of production traffic to the new snapshot, log both responses, score them with LLM-as-judge or human review. Promote when confident.
- Roll out behind a feature flag. Move 1% → 10% → 50% → 100% with quality and cost gates at each step.
- Keep the old snapshot warm. Until the provider sunsets it, your fallback target is the old version. That is your “undo” button if the new model regresses.
The shorthand: treat model versions like database schema versions. You do not rename columns in production without a migration. Stop swapping models in production without one either.
Routing in Practice: A Working Architecture
A reasonable production routing stack looks like this:
[app code]
↓ (OpenAI-compatible API call)
[gateway: LiteLLM / Portkey / your own]
↓
├── policy: capability + cost + latency rules
├── version: dated snapshot from config
├── cache: exact-match + semantic lookup (see caching guide)
├── retry: 1 attempt on transient errors
└── fallback: defined per-route
↓
[provider A] [provider B] [provider C]
The gateway is the single point of policy. The cache (covered in our LLM caching guide) sits inside the gateway so cache hits skip the model call entirely. The quotas and per-tenant guards live alongside — we cover those in LLM rate limiting and token quotas. And every routed call emits structured telemetry that flows into your cost attribution pipeline and your observability stack.
That is the system. Each piece is replaceable; none is optional.
What to Measure
The routing layer is worthless if you cannot see what it is doing. The minimum observable surface:
- Requests by route, model, and version — the table you scan first during any incident.
- Cost per route, per tenant, per feature — the table your CFO will eventually ask for.
- Fallback rate per route — climbing fallback rates mean a provider is degrading before they admit it.
- Cache hit rate per route — climbing miss rates mean prompt drift or new traffic patterns.
- Time-to-first-token and total latency by model — for streaming UX SLOs.
- Snapshot in use vs. snapshot configured — catches the day someone slipped an alias back in.
These feed alerts, dashboards, and the monthly review where someone asks “why did spend on this feature double” and you actually have an answer.
Build Routing That Survives Your Next Provider Incident
If your team is wiring up a new LLM stack — or untangling one that has grown past its first architecture — our engineers can help design routing, gateways, fallbacks, and version-pinning discipline that will not surprise you at 2 AM.
How metacto Approaches LLM Routing
We treat routing as part of the production system, not as a code-level afterthought. In our Operational AI engagements we wire model selection, provider fallback, and version pinning into the same governance loop as evals, observability, and cost attribution. The point is not picking the “best” gateway in the abstract — it is picking the one that fits your team’s operating model and integrates cleanly with the rest of your AI stack.
The teams that get this right share a habit: they treat every production model call as a policy decision, not an API call. Routing is what makes that policy explicit.
This is one layer of the system underneath the chat box — the gap between an impressive demo and production AI. The other layers we cover in this series include prompts as products and the work it takes to be production-ready. Routing is the one that quietly determines whether your AI bill, your reliability, and your model behavior stay under your control as you scale.
Frequently Asked Questions About LLM Routing
What is an LLM gateway and do I need one?
An LLM gateway is a proxy layer that sits between your application and model providers. It centralizes provider keys, routing rules, retries, fallbacks, caching, and observability behind a single API. You probably need one once you call more than one model in production, route traffic by tier or tenant, or need per-team budgets and audit logging. For a single-provider, single-model prototype, a gateway is overhead you do not need yet.
LiteLLM vs Portkey vs OpenRouter — which one should I pick?
LiteLLM is the typical choice when you want to self-host and keep traffic and keys inside your own infrastructure with the lowest added latency. Portkey fits teams that want production guardrails like PII redaction and jailbreak detection without building them, and as of March 2026 its core gateway is open-source under Apache 2.0. OpenRouter is the lowest-friction option — one key, many models, fully managed — at the cost of routing through a third party and adding 100-150ms per request. None of them are wrong; the right answer depends on your data posture, latency budget, and ops capacity.
Why do I need to pin model versions if the provider says the new one is better?
Two reasons. First, your prompts and evals are calibrated against the specific model behavior you tested. A new snapshot — even a minor revision — can shift output distribution enough to break downstream parsers, regress quality on edge cases, or change cost characteristics. Second, you control the timing. Pinning lets you evaluate the new model in shadow, compare it on your own eval suite, and roll it out behind a feature flag. Aliases take that control away by changing the underlying model without warning.
What should I fall back to when my primary model fails?
It depends on how much you care about consistent capability versus consistent prompts. Same-tier, different-provider fallback (Claude Sonnet to GPT-5) keeps capability close to constant and protects against provider-specific incidents, but requires that prompts work on both. Same-provider, lower-tier fallback (GPT-5 to GPT-5 mini) is easier to maintain but degrades quality. For high-stakes flows, the right fallback is often a templated response, a cached answer, or a human queue — a deliberate degradation is better than a confidently wrong answer.
How do I prevent my LLM bill from exploding when I use a routing layer?
Three controls. First, default to the cheapest model that passes your eval suite for the task and only escalate when needed — most production traffic does not need the frontier tier. Second, enforce per-tenant, per-feature token quotas at the gateway so a single user or runaway loop cannot consume the whole budget. Third, instrument cost attribution so you can see which feature, tenant, and route is driving spend before the invoice arrives. We cover the attribution side in detail in our LLM cost attribution guide.
How does LLM routing relate to caching?
Caching is what makes routing cheap. Every request that returns from cache is a request that did not need to be routed to a model at all. In a mature stack, the cache check happens inside the gateway before the routing decision is made, so a cache hit skips both the routing logic and the provider call. Routing and caching are best designed together — our LLM caching guide covers the exact-match, semantic, and provider-side prompt caching layers that stack on top of routing.