Building MCP Servers for Production AI Agents

The Model Context Protocol tutorials get you to a working MCP server in twenty minutes. A weather tool. A filesystem reader. The Claude desktop client picks it up, the agent calls it, and the demo lands. Ship that same server to production and you discover the parts the tutorial skipped: authentication for real users, input validation against adversarial agents, rate limits that survive a retry storm, tenant isolation, audit logs, and a deployment topology that does not collapse under concurrency.

This is the gap between an MCP demo and a production MCP server. It is also the gap where most teams quietly fail. A widely cited audit found that 53% of public MCP servers rely on static API keys and only 8.5% implement OAuth, and security researchers have already documented one-click account takeover vulnerabilities in remote MCP deployments that mishandled OAuth consent flows (Obsidian Security).

This article is not a “what is MCP” piece. Anthropic’s spec and the Model Context Protocol authorization documentation already cover that. This is a production guide for engineering teams shipping MCP servers that real agents call against real data, on behalf of real customers. It is also part of the larger question of why your AI experiments are failing before they ever generate measurable business value.

The Production Gap Most MCP Tutorials Skip

A tutorial MCP server does six things: starts up, advertises tools, accepts a stdio or HTTP request, calls a backend, returns a result, shuts down. A production MCP server has to do all of that, plus:

Authenticate the agent and the human user behind it
Authorize each tool call against per-user permissions
Validate every argument against a schema that hostile inputs cannot escape
Enforce per-tenant and per-tool rate limits
Survive partial failures from downstream APIs without poisoning the agent’s state
Emit traces, metrics, and audit logs detailed enough to debug a bad answer six hours after the fact
Run multi-tenant without leaking data between customers
Version its tool surface so upgrades do not silently break agents in the field

None of this is novel. It is the standard surface area of any production API. What is novel is that the caller is a non-deterministic LLM that will probe edge cases your QA team never imagined, retry on transient errors in ways that compound load, and occasionally hallucinate arguments that pass schema validation but make no semantic sense. That changes how you design every layer.

The Demo-to-Production Failure Mode

The most common MCP production failure is not a security breach. It is silent quality collapse. An agent calls a tool, the tool returns a 200 with a degraded result (truncated context, stale cache, partial rows), the agent treats it as authoritative, and the user gets a confidently wrong answer. Production MCP servers must distinguish “success” from “partial success” in their response shape, not just their HTTP status.

Authentication: OAuth 2.1, Not API Keys

The MCP specification mandates OAuth 2.1 with PKCE for remote servers. This is not optional for production. Static API keys cannot represent a human end-user, cannot be scoped per-session, cannot be revoked without a deploy, and cannot be audited at the granularity regulators expect. They are fine for a local development scenario; they are a liability in production.

The minimum production authentication baseline:

OAuth 2.1 with mandatory PKCE for every flow, including confidential clients. PKCE was optional under OAuth 2.0 and is non-negotiable under 2.1.
Exact redirect URI matching. No wildcards, no prefix matches. The OAuth pitfalls that produced 2025’s MCP takeover vulnerabilities almost all involved loose redirect handling.
Dynamic Client Registration (DCR) if you expect third-party MCP clients to connect. This is part of the MCP spec for a reason: it is how a new client joins your server without a human registering it by hand.
Resource indicators and protected resource metadata so tokens are bound to the specific MCP server they were issued for, not reusable elsewhere in your fleet.
Short-lived access tokens (minutes, not days) paired with refresh tokens. Long-lived bearer tokens in an agent’s working memory are a disaster waiting for a prompt injection to trigger it.

OAuth 2.1 implementation is detailed in the Prefect MCP OAuth guide and the official Model Context Protocol authorization spec. The pattern most production teams converge on: do not implement OAuth yourself. Use a managed identity provider (Auth0, WorkOS, your existing OIDC provider) and let it handle the issuance flows. Implementing OAuth from scratch is how teams ship the vulnerability class researchers keep finding (WorkOS).

Distinguish Agent Identity from User Identity

A production MCP server has two identities on every request: the agent (a non-human identity, usually a workload credential) and the user the agent is acting on behalf of. Authorization decisions must consider both. An agent with admin scope should not be able to act on behalf of a user without that user’s grant, and an authorized user should not have their permissions silently elevated because the agent’s service account is more privileged.

This maps directly onto AI agent secrets management. The credentials the MCP server uses to call downstream APIs are not the same credentials the user delegated to the agent. We cover that distinction in depth in our piece on AI agent secrets management.

Input Validation: Assume the Caller Will Try Everything

A traditional API client is a piece of code written by a developer. It calls your endpoint the way the developer told it to. An LLM client is not. It will:

Pass arguments that match your schema but violate your invariants (a start_date after the end_date)
Pass strings that look like SQL, shell commands, or path traversals because the user asked it to “search for the file named ’../../../etc/passwd’”
Pass tool calls in rapid succession during a retry loop after a transient failure
Concatenate prior tool results into new tool calls, propagating earlier injection attempts into your server

Treat every MCP tool argument as untrusted user input even when the tool’s “user” is your own agent. The validation layer should:

Enforce a strict JSON Schema for every tool’s parameters. Use additionalProperties: false. Mark required fields explicitly. Use enum for closed sets.
Apply semantic validators after structural validation. Schema says “string”; semantic validator says “is a valid UUID for a record this tenant owns.”
Reject silently impossible inputs. Negative quantities, dates in the year 9999, IDs that don’t exist. Do not pass these through to the backend and rely on the database to throw.
Bound payload sizes. A 10MB blob in a tool argument is almost certainly a mistake or an attack.
Sanitize values that flow to subprocesses. If a tool eventually shells out, the LLM’s argument is on the command line. Treat it like any other untrusted command-line input.

The schema is also your contract with the model. A tight schema with short keys, clear enums, no optional fields unless you genuinely accept missing data, and concrete description strings produces dramatically more reliable tool calls than a sprawling, optional-everything schema. Schema design is the single highest-leverage reliability lever; we expand on it in AI agent tool calling in production.

Schema Strictness Is a Reliability Knob

Tool-use accuracy depends more on argument correctness and strict schema adherence than on the underlying model’s raw capability. Smaller, faster models paired with strict schemas frequently match the reliability of larger models with loose schemas, at a fraction of the cost. The schema is leverage.

Rate Limiting: Two Problems, Two Layers

Rate limiting an MCP server is two distinct problems that look like one.

Problem 1: Upstream protection. Your tool calls hit downstream APIs that have their own quotas. A misbehaving agent in a retry loop can burn a paid API budget in minutes. Rate limiting here protects your wallet and your downstream provider relationships.

Problem 2: Multi-tenant fairness. With multiple agents (and multiple tenants) calling the same MCP server, one bad actor can starve the others. A single agent stuck in a loop can consume all available connections and degrade every other tenant’s experience.

These two problems need different solutions:

Layer	Purpose	Key dimension	Action on breach
Per-tenant quota	Multi-tenant fairness, cost control	tenant_id	Reject with 429, surface to tenant dashboard
Per-agent quota	Loop protection	agent_id (session)	Backoff signal, optional circuit break
Per-tool quota	Protect expensive tools	tool_name + tenant_id	Reject specific tool, allow others
Upstream quota	Protect downstream APIs	upstream_endpoint	Queue or shed, depending on tool semantics

Implement primary rate limiting at the MCP server (or gateway) level, with backend services providing secondary protection. Use a token bucket per tenant, not just per IP — agents share IPs constantly. Surface rate-limit signals back to the agent in a structured form (Retry-After, error code, tool-specific budget remaining) so the agent’s retry logic can be intelligent rather than blind. The Fastio production rate limiting guide and MintMCP gateway analysis are useful operational references.

Multi-Tenant MCP: The Hard Part

Multi-tenant MCP is where most production deployments slow down. Two patterns dominate, with different tradeoffs.

Pattern A: Shared MCP server, tenant-scoped everything

One MCP server process handles all tenants. Every tool call carries a tenant context resolved from the OAuth token. Authorization, data access, rate limits, and logs are all keyed on tenant_id.

Pro: Single deployment, lowest infrastructure cost, easiest to operate.
Con: Cross-tenant data leak is one bug away. Noisy-neighbor problems are real. Tools that take a long time to execute create cross-tenant latency.
Use when: Tenant trust model is uniform, tool set is small and read-heavy, data access patterns are clearly tenant-scoped.

Pattern B: Per-tenant MCP server instance (sandboxed)

A dedicated MCP server runtime per tenant (process, container, WASM sandbox). The tenant’s data and credentials never share a process with another tenant’s.

Pro: Hard isolation. Compliance-friendly. A compromised tool execution cannot reach another tenant.
Con: Higher infrastructure cost, cold-start latency on first tool call, more complex routing.
Use when: Regulatory requirements demand isolation (healthcare, finance), tools execute write operations, tools run untrusted code.

The MCP best practices community recommends sandboxing (containers, WebAssembly runtimes, similar isolation) for anything beyond read-only context tools (Nordic APIs MCP best practices). In practice we see hybrid deployments: a shared MCP server for low-risk read-only tools, per-tenant sandboxed runtimes for write operations or tools that touch sensitive data.

MCP vs Function Calling: When to Use Each

This is the question every team building production agents eventually asks, and the SERP is full of “MCP is the future” content that does not engage with the tradeoffs honestly. Both approaches expose tools to an LLM. They differ in where the tool surface lives and what that costs you operationally.

Dimension	Function calling	MCP
Tool location	Embedded in the agent application loop	Network-addressable, separate process
Model coupling	Schema is provider-specific (OpenAI vs Anthropic vs Google)	Provider-agnostic; tools work across MCP clients
Discovery	Static; agent knows tools at build time	Dynamic; agent discovers tools at runtime
Deployment	One process	Two (or more) processes, separate scaling
Auth surface	Inside the application	At the MCP server boundary, OAuth-native
Operational cost	Lowest. One service to deploy, monitor, debug	Higher. Two services, two failure modes, two latency budgets
Cross-team reuse	Hard. Each team rewrites the tool wiring	Strong. One MCP server, many agent consumers
Best for	Single team, small tool set, single model provider	Multiple consumers, multiple providers, governance requirements

Use function calling when

You are prototyping. The agent and the tools are the same codebase, the same team, the same deploy cycle.
You have fewer than ~5 tools and they will not be reused outside this agent.
You are committed to a single model provider and the lock-in is acceptable.
Operational simplicity matters more than reusability. One service to monitor beats two.

Use MCP when

Multiple agents (built by different teams, possibly powered by different model providers) need the same tools.
Tools represent governed enterprise resources (CRM access, database queries, internal APIs) where centralized auth and audit are non-negotiable.
You expect to swap or A/B model providers without rewriting your tool layer.
The same tool should appear in a chat agent, an autonomous workflow, and a developer’s IDE without re-implementation.

The honest synthesis: most teams should start with function calling and graduate to MCP when they have at least two consumers for the same tools or when a compliance requirement makes a centralized, audited tool boundary mandatory. Premature MCP adoption is a real anti-pattern — you take on the operational cost of a second service before you have the consumer count to justify it. Solid comparisons live at Prefect’s MCP vs function calling and Descope’s analysis.

The convergence trend matters too: OpenAI has deprecated its Assistants API in favor of MCP, and Google has adopted the protocol. If you are building for the next three years, the long-term direction is clear. The question is whether you need to be there today.

Error Handling: What “Failure” Means to an LLM

A traditional API client treats a 500 as “retry once, then fail loud.” An LLM client treats a 500 as “try again, maybe with different arguments, maybe several times, maybe with a completely different approach.” This is good — agents that recover from transient failures are more useful. It is also dangerous: an agent that retries a non-idempotent write because the response timed out can double-charge a customer, double-send an email, or duplicate a database row.

Production MCP servers must:

Distinguish retryable from non-retryable errors in the response. Do not just return 500. Return a structured error object with an error_class (transient, permanent, validation, authorization, rate_limit) so the agent’s retry policy can be specific.
Make every write operation idempotent. Require an idempotency key on the tool call. Reject duplicates server-side. This is the single most important reliability practice for production agents.
Return partial results explicitly. If the tool succeeded on 8 of 10 records, the response shape should make that obvious. Returning 200 with a truncated list and no metadata is how silent quality failures happen.
Bound retry semantics. Communicate Retry-After clearly. Surface remaining budget. Cap the agent’s retry attempts at the server.

This is part of the broader topic of tool-calling reliability in production, which we cover in depth in AI agent tool calling in production, and connects directly to the orchestration question of where retries belong — at the tool layer, the agent layer, or the workflow layer. That is the subject of AI agent orchestration patterns.

Ship MCP Servers That Survive Production

The gap between an MCP demo and a production MCP server is auth, validation, multi-tenancy, and operational discipline. Our engineering team has shipped MCP infrastructure that handles real customers, real data, and real adversarial inputs. Talk with us about your production MCP architecture.

Observability: Audit Logs Are Not Optional

Every production MCP server emits, at minimum:

Per-call audit logs with tenant_id, user_id, agent_id, tool_name, argument hash (not raw arguments, which may contain PII), response status, latency, and tokens consumed downstream.
Distributed traces that connect the agent’s LLM call to the tool invocation to any downstream API calls, using OpenTelemetry GenAI semantic conventions so the traces are portable across observability backends.
Per-tool metrics (call rate, p95 latency, error rate, schema-validation failure rate) so quality regressions are visible before users complain.
Authorization decision logs — every “permitted” and every “denied” — because compliance auditors will ask, and so will your incident reviews.

This is the difference between monitoring an MCP server (does it respond?) and giving it observability (can you explain any specific tool call after the fact?). The umbrella concept and what production systems must expose lives in AI agent observability.

The Production MCP Checklist

If you are about to deploy an MCP server, this is the minimum bar:

Capability	Status
OAuth 2.1 with PKCE for all flows	Required
Exact-match redirect URIs	Required
Strict JSON Schema validation, `additionalProperties: false`	Required
Semantic validation after structural validation	Required
Per-tenant rate limits with surfaced quotas	Required
Idempotency keys on all write tools	Required
Structured error classes in responses	Required
Per-call audit logs with tenant/user/agent IDs	Required
Distributed traces using OTel GenAI conventions	Required
Tenant isolation (shared with scope, or per-tenant sandboxed)	Required
Tool versioning strategy for the surface	Required
Documented blast-radius analysis per write tool	Required

Most production MCP failures we see in incident reviews trace back to two or three rows of this table being skipped because “we’ll add it later.” Later is the incident.

Where MCP Fits in metacto’s Operational AI Work

MCP is one piece of the larger production AI architecture. The MCP server is where tools become governed, reusable surfaces. The agent layer above it decides which tools to call. The context layer feeds both. Getting MCP right is necessary; it is not sufficient.

We treat MCP infrastructure as part of broader Operational AI engagements — the surface a production agent uses to act on enterprise data. It is one layer of the system underneath the chat box, the gap between an impressive AI pilot and software your business can depend on. If your team is shipping MCP servers and the operational details in this article look like work you have not done yet, that gap is where we work.

Frequently Asked Questions

Do I need MCP, or is function calling enough for production?

Function calling is sufficient for production when you have a single agent, a single team, a single model provider, and fewer than about five tools that will not be reused elsewhere. MCP becomes the right answer when multiple agents or teams need the same tools, when governance and centralized audit are required, or when you want the option to swap model providers without rewriting your tool layer. Most teams should start with function calling and graduate to MCP when they have a second consumer for the same tools.

Is OAuth required for an MCP server?

OAuth 2.1 with PKCE is required by the Model Context Protocol specification for remote MCP servers. Static API keys are acceptable only for local development. In production, OAuth is non-negotiable because it represents human identity, supports per-user scoping, allows revocation without redeployment, and produces the audit trail that compliance frameworks expect. Implement it through a managed identity provider rather than writing it yourself.

How do I prevent prompt injection through MCP tool arguments?

Treat every tool argument as untrusted input even when the caller is your own agent. Apply strict JSON Schema validation with additionalProperties set to false, follow it with semantic validators that check business invariants, and sanitize any value that flows to a subprocess or shell. Critically, never trust prior tool results as safe — an injection delivered through one tool can propagate into the arguments of the next tool the agent calls.

How should an MCP server handle multi-tenant isolation?

Two patterns dominate. A shared MCP server with tenant-scoped authorization, rate limits, and data access is operationally simpler and appropriate for read-heavy, low-risk tools. A per-tenant sandboxed instance using containers or WebAssembly runtimes provides hard isolation and is the right choice for write operations, regulated data, or tools that execute untrusted code. Many production deployments use both — a shared server for read tools, sandboxed instances for write tools.

What rate limits should an MCP server enforce?

Production MCP servers need at least four rate-limit layers: per-tenant quotas for fairness and cost control, per-agent quotas for loop protection, per-tool quotas for protecting expensive operations, and upstream quotas to respect downstream API budgets. Use token-bucket limits keyed on tenant and agent identifiers, not IP addresses. Surface remaining budget back to the caller so agent retry logic can be intelligent.

Where should rate-limit and retry logic live, the agent or the MCP server?

Both, with clear responsibilities. The MCP server enforces hard limits — it is the source of truth for quotas and the authority that rejects abusive calls. The agent should respect Retry-After signals and structured error classes the server returns. The pattern that fails is putting all retry intelligence in the agent and trusting it to behave; agents in retry loops are a common cause of production incidents.

Building MCP Servers for Production AI Agents: Beyond the Tutorial