A team I worked with had a customer-support agent in production. The eval scores were good. The traces were clean. The agent picked the right tool 94% of the time. Then leadership got a Slack from a customer whose refund had been processed three times. Then five. The agent was calling process_refund correctly. The tool itself was retrying on a transient timeout the agent’s retry loop was also retrying. There was no idempotency key. Every retry was a real refund.
The model did nothing wrong. The reasoning was sound. The blast radius was just larger than anyone had designed for.
This is what production tool calling actually looks like. The hard part is not getting the LLM to pick the right tool. Modern models do that well. The hard part is everything around the tool: the schema the LLM has to satisfy, the auth scope the tool runs with, the error class the tool returns, the retry budget, the idempotency contract, and the bound on what the worst case can do.
This article is a production guide for that surface. It is part of the larger question of why your AI experiments are failing — specifically, why agents that pass evals still misbehave in the wild.
Where Tool Calling Actually Fails
The model layer is rarely the problem. The tool layer is. In production, the failure modes cluster:
- Wrong tool selected for the user’s intent. Often a schema-description problem, not a reasoning problem.
- Right tool, wrong arguments. The model passed a string where the contract expected a UUID, or a
start_dateafter theend_date. - Tool succeeded, agent misinterpreted the result. A 200 with a truncated list looks like complete success to a model.
- Tool failed, agent retried into a worse state. The classic refund-charged-three-times pattern.
- Tool succeeded, but with the wrong identity. The agent’s service account had more permission than the user delegating to it.
- Tool consumed a downstream budget the team didn’t know existed. A search tool quietly burning $4 of LLM tokens on every call.
A study of production agent failures put it bluntly: the tool layer is where incidents happen, not the model reasoning. The model picks the tool, passes malformed arguments, gets back an unhandled error, and confidently emits the wrong answer (BSWEN reliability guide). Error-handling middleware applied carefully has been reported to recover the majority of transient failures without human intervention, but only when error classes are surfaced explicitly.
Agents Fail Silently More Often Than Loudly
A traditional system failure is loud: a 500, a stack trace, a paged engineer. Agent tool-call failures are quiet: a 200 with a partial result the agent treats as authoritative, a retry that succeeded after a non-idempotent first attempt, a tool that returned stale cache because the upstream was rate-limited. Production tool calling has to make these failures visible.
Schema Design Is the Highest-Leverage Reliability Knob
The schema you hand the model is your contract. A tight schema produces dramatically more reliable tool calls than a sprawling, optional-everything schema — even with the same underlying model. The practitioner consensus in 2026 is clear: tool-use accuracy depends more on argument correctness and strict schema adherence than on raw model capability, and small models paired with strict schemas frequently match larger models with loose ones (Agenta structured outputs guide).
The schema rules that hold up in production:
- Keep keys short and unambiguous.
customer_id, notthe_id_of_the_customer_to_lookup. Long keys waste tokens and add noise. - No optional fields unless you genuinely accept missing data. Optionality is where models hallucinate.
- Use
enumfor closed sets. Status, type, region — anything with a finite set of valid values. Enums turn a string-validation problem into a single-token choice. - Use
additionalProperties: false. Schemas that silently accept extra keys silently accept hallucinated keys. - Write descriptions for the model, not for humans. The description is a prompt. It should say what the tool does, what each field means, and what is not the right tool to use. “Use
search_ordersonly when the user provides a customer email; for order ID lookups, useget_order.” - Bound types tightly.
integerwithminimumandmaximum.stringwithformat: uuidorpattern.date-timewith explicit ISO-8601 examples.
Provider-side strict modes matter. OpenAI’s structured output mode and Anthropic’s tool-use enforcement both reduce schema-violation rates substantially through constrained decoding. They do not eliminate semantic errors — Anthropic explicitly notes that Claude makes a best effort but does not guarantee schema compliance — so structural validation at your server is still required (OpenTelemetry GenAI observability).
Decompose Schemas That Are Too Big
If a tool’s schema has more than 8–10 fields, or if fields have complex interdependencies, decompose. Two narrow tools with strict schemas outperform one wide tool with optional fields, even though the agent has to make more calls. This is the decompose-and-aggregate pattern, and it consistently improves reliability on complex extractions at the cost of more latency and more tokens. Most production agents err too far toward fat tools.
Error Handling: The Error Class Belongs in the Response
The biggest single reliability improvement you can make to a production tool layer is communicating failure semantically. A 500 tells the agent nothing useful. A response like this tells it everything:
{
"error": {
"class": "transient",
"retryable": true,
"retry_after_ms": 2000,
"code": "upstream_timeout",
"message": "CRM API timed out after 5s"
}
}
The error class is the contract. The minimum vocabulary:
| Class | Retryable | What the agent should do |
|---|---|---|
validation | No | Surface to the user; do not retry with the same arguments |
authorization | No | Surface to the user; do not retry; do not escalate scope |
not_found | No | Surface to the user; consider whether intent was misunderstood |
transient | Yes | Retry with backoff up to a bounded count |
rate_limit | Yes (delayed) | Respect retry_after_ms; do not loop |
permanent | No | Surface to the user; do not retry; log for investigation |
partial_success | Sometimes | Inspect succeeded and failed arrays; do not assume completion |
When a model sees structured error semantics, its retry behavior becomes intelligent rather than blind. When it sees a generic 500, it improvises — and improvisation in a retry loop is how production incidents happen.
Idempotency: The Single Most Important Practice
Every write operation an agent can call must have an idempotency key, and the tool must reject duplicates server-side. This is not negotiable. It is the only mechanism that makes agent retry behavior safe.
The pattern:
- The agent generates a stable idempotency key per logical operation (not per retry — per operation).
- The tool stores the key with the result for a bounded window (24 hours is typical).
- On a retry, the tool sees the key, returns the original result, and does not re-execute the side effect.
- The agent’s prompt and the tool’s description both make idempotency contractually explicit.
Without this, every retry path is a production incident waiting to happen. The team I opened with — three refunds — did not have it. Every team that has had an “agent did the same thing three times” incident did not have it. Build it in the first tool you ship.
Auth Scoping: The Principle of Least Privilege, Per Call
The blast radius of an agent’s mistake is bounded by the credentials it executed under. If the agent ran with full admin scope, the blast radius is full admin. If the agent ran with read-only scope on a single customer record, the blast radius is exactly that.
Production tool calling enforces least privilege at the per-call level, not at the agent-process level:
- Tokens minted per task. When a user delegates an action to an agent, the agent receives a token scoped to that action, that user, that resource, with a short lifetime. Just-in-time credentials.
- Two identities on every tool call. The agent’s workload identity (what the agent is allowed to do) and the user’s delegated identity (what this user asked the agent to do on their behalf). The tool authorizes against both.
- Write scope is granted explicitly per tool. Read tools never use write credentials. Write tools never see credentials they cannot use.
- No credential persistence past the task. When the task ends, the token is revoked, not just allowed to expire.
This is the discipline the agent blast radius analysis at Tian Pan describes as a precondition for shipping autonomous agents. We cover the credential side specifically in AI agent secrets management.
A Cursor AI Agent Deleted a Production Database
The widely reported case where a Cursor coding agent deleted a production database — including volume-level backups — while attempting to fix a credential mismatch is the canonical blast-radius cautionary tale. The model did not malfunction. The tool had the permission to do it, with no human in the loop, no read-only equivalent, no dry-run mode, no irrevocable-action gate. The blast radius equaled the credentials.
Bounding Blast Radius Before the Agent Misfires
Blast-radius analysis happens at design time, per tool, before the agent is wired up to call it. The questions:
- What is the worst case if the LLM calls this tool with hallucinated arguments?
- Is the operation reversible? In what window?
- Could a single call affect more than one user, tenant, or record?
- Does a successful call cost money? How much? With what cap?
- What evidence would I need six months from now to investigate a bad call?
The mitigations are well-known but rarely applied consistently:
- Read-only tools are the default. Read tools are safer; require explicit justification for write tools.
- Irreversible writes gate behind a human approval step. Sending an email, processing a payment, deleting a record — the agent proposes, a human approves.
- Dry-run modes. A
simulate=trueparameter that returns what would happen without doing it. - Cost caps per call. Tools that consume billable resources expose a
max_cost_usdparameter, and the tool refuses to proceed past it. - Resource bounds per call. “Update at most 100 records.” “Send to at most 50 recipients.”
- Audit logs that capture intent. Not just what the tool did, but what arguments the model supplied and the prior tool calls that led there.
These mitigations look like they slow agents down. In production, they are the difference between an autonomous agent your business trusts and a liability your CFO does not.
Make Production Tool Calling Boring
A production AI agent is exciting. A production tool layer should be boring: tight schemas, explicit error classes, idempotency keys, scoped credentials, bounded blast radius. Talk to our team about building the tool layer your agents call — and the operational discipline behind it.
Tool Selection: Help the Model Pick Right
When agents pick the wrong tool, the cause is almost always schema metadata, not model capability. Five rules that consistently improve tool selection:
- Each tool has a distinct name.
search_customersandfind_customerandlookup_customershould not all exist. Pick one shape. - Descriptions disambiguate against neighbors. “Use
search_customersfor fuzzy lookups by name or email; for exact ID lookups, useget_customer.” The model needs to know when not to use this tool. - Examples in the description are leverage. “Example:
search_customers(query='acme corp')returns up to 10 matches.” One example beats a paragraph of prose. - Cap the tool surface per task. An agent given 80 tools picks worse than the same agent given the 6 relevant to the current task. Where the surface is large, use a dispatcher pattern: a coarse “intent classification” step that narrows the tool set before tool-calling begins.
- The tool surface is a product. Treat it that way — versioned, documented, deprecated explicitly, with a changelog the agent can reason about.
The dispatcher pattern connects directly to orchestration. We cover the architectural choices in AI agent orchestration patterns.
When to Use Function Calling vs MCP
Most production teams will use both. Function calling is right when the tool, the agent, and the deploy cycle are all in one place. MCP is right when multiple agents need the same tools, when governance demands a centralized audit boundary, or when you want the option to swap model providers without rewriting your tool layer. The tradeoff is operational cost: function calling is one service; MCP is two. We expand the decision criteria in building MCP servers for production AI agents.
A useful rule of thumb: start with function calling. Graduate to MCP when you have a second consumer for the same tools, or when a compliance requirement makes it mandatory. Premature MCP is a real anti-pattern.
Observability for the Tool Layer
You cannot improve what you cannot see. Per-tool metrics that matter in production:
- Tool selection accuracy — when the agent had a choice, did it pick the right one? Requires ground truth from offline evals or sampled human review.
- First-attempt argument validity — what percentage of tool calls pass schema validation without retry? This is the cleanest leading indicator of schema or prompt quality.
- Error class distribution — are most errors transient, validation, or authorization? The shape tells you where to invest.
- Tool latency p95 by tool and by tenant — slow tools propagate into agent latency users feel.
- Idempotency hit rate — how often is a key seen twice? A rising rate is a retry-storm leading indicator.
- Downstream cost per call — surface this to the team owning the tool, not just to FinOps.
This is the tool-level slice of broader AI agent observability. Step-level traces — logs that capture each tool call, its arguments, its result, and the reasoning step that followed — are the prerequisite. Without them you cannot debug a bad answer six hours after the fact. The SOPHOS blast-radius analysis and the TianPan blast-radius framework both treat trace completeness as the bedrock of safe agent deployment.
The Production Tool-Calling Checklist
If you are about to ship an agent that calls real tools against real data, this is the minimum bar:
| Practice | Status |
|---|---|
Strict JSON Schema with additionalProperties: false for every tool | Required |
| Tool descriptions disambiguated against neighbors | Required |
enum for all closed-set fields | Required |
| Structured error classes in every response | Required |
| Idempotency keys required for every write tool | Required |
| Just-in-time credentials, scoped per task | Required |
| Read-only by default; explicit justification for write tools | Required |
| Irreversible writes gated by human approval | Required |
| Cost caps and resource bounds per call | Required |
| Step-level traces with arguments and reasoning | Required |
| Per-tool error-class and selection-accuracy metrics | Required |
| Documented blast-radius analysis per tool | Required |
Every production AI incident I have seen in the last eighteen months traces back to two or three rows of this table being skipped.
Where Tool Calling Sits in the Production AI Stack
Tool calling is one layer of the system underneath the chat box. Above it is orchestration: which tools to call, in what order, with what state (AI agent orchestration patterns). Below it is the surface those tools expose to enterprise data (building MCP servers). Around it is the observability and evaluation discipline that catches regressions before customers do (AI agent observability).
Treat tool calling as a product surface, not a glue layer. The teams that ship reliable production agents do this. The teams that ship demos do not. metacto’s Operational AI work is, in large part, building this surface for clients who have already proven the AI use case and now need the engineering discipline to make it production-grade.
Frequently Asked Questions
What is the single most important practice for reliable AI agent tool calling?
Idempotency. Every write operation an agent can call must have an idempotency key, and the tool must reject duplicates server-side. Without it, every retry path is a production incident waiting to happen — the canonical example being agents that process the same refund or send the same email multiple times because a transient timeout triggered a retry loop. Idempotency makes agent retry behavior safe by construction.
How should an LLM tool return errors?
Return a structured error object that includes an error class (validation, authorization, not_found, transient, rate_limit, permanent, partial_success), a retryable flag, and a retry_after hint when applicable. A bare 500 tells the agent nothing useful and produces blind retry loops. Structured error semantics turn agent retry behavior from improvisation into intelligent recovery and dramatically reduce production incidents.
How big should an LLM tool's parameter schema be?
Keep it under 8-10 fields per tool. If a single tool needs more, decompose into multiple narrower tools. Strict schemas with short keys, required fields only where you genuinely accept missing data, enums for closed sets, and additionalProperties set to false produce dramatically more reliable tool calls than sprawling schemas. Schema strictness is one of the highest-leverage reliability knobs you have.
What credentials should an AI agent run with?
Just-in-time, per-task, least-privilege credentials. The agent receives a token scoped to the specific action, the specific user, and the specific resource, with a short lifetime, and that token is revoked when the task ends. The blast radius of an agent mistake is bounded exactly by the credentials it executed under, so admin-scoped agents have admin-scoped blast radius and read-only agents have read-only blast radius.
Should write operations require human approval?
Irreversible or high-impact write operations should, until your evals and audit logs give you the evidence to remove the gate. Sending external emails, processing payments, deleting records, and similar actions are the canonical examples. The pattern is: agent proposes, human approves, tool executes. The approval gate is removed only when production data demonstrates the agent's accuracy on this specific action class meets a defined bar.
How do I improve which tool the agent picks?
When agents pick wrong, the cause is almost always schema metadata, not model reasoning. Make every tool name distinct, write descriptions that explicitly disambiguate against neighboring tools (use this when X, use the other tool when Y), include concrete examples in the description, and cap the tool surface per task. An agent given 6 relevant tools picks better than the same agent given 80, even when the 6 are a subset of the 80.