AI Agent Tool Calling in Production: Reliability Checklist

A team I worked with had a customer-support agent in production. The eval scores were good. The traces looked clean. Then leadership got a Slack from a customer whose refund had been processed three times. Then five.

The agent was calling process_refund for the right reason. The tool retried on a transient timeout, and the agent’s retry loop retried too. There was no idempotency key. Every retry became a real refund.

That was not a model hallucination. It was a production tool-calling failure: the schema, retry policy, permissions, observability, and blast-radius controls around the model were not designed as workflow infrastructure.

If you are searching for how production AI teams handle agent tool calls, this is the answer: treat every tool call as a governed action. LLM function calling is the interface. The production system is everything that decides whether the call is valid, authorized, idempotent, observable, reversible, and bounded.

This guide is for teams whose agents already call tools against CRM, support, billing, ticketing, code, documents, or internal systems. It is part of the larger question of why AI experiments fail after a promising pilot: agents can pass demos and evals while still being unsafe to run against real workflows.

Production Tool-Calling Checklist

Before an agent calls a tool that can read customer data, update a system, send a message, trigger a workflow, or spend money, the production checklist is short and unforgiving:

Layer	Production requirement	Failure it prevents
Tool contract	Narrow tool names, strict JSON Schema, `additionalProperties: false`, enums for closed sets	Wrong tool selection and malformed arguments
Validation	Server-side argument validation before execution	Model-generated values reaching real systems unchecked
Errors and retries	Structured error classes, bounded retry policy, explicit `retry_after_ms`	Retry loops, rate-limit loops, and ambiguous failures
Writes	Idempotency key for every side-effecting operation	Duplicate refunds, duplicate emails, duplicate updates
Auth	Per-task, least-privilege credentials checked against both user and agent identity	Tool calls running with broader authority than the user intended
Blast radius	Read-only default, dry-run mode, human approval for irreversible writes, cost and resource caps	One bad call changing too much at once
Logging	Step-level logs with tool name, schema version, arguments, result, error class, identity, and approval state	Incidents that cannot be reconstructed later
Operations	Tool-level evals, monitoring, runbooks, and sampled review	Silent degradation after launch

The Tool Call Is a Workflow Step

A function call is not just a JSON object leaving the model. In production, it is a workflow step with permissions, side effects, accountability, and a failure mode. Design it with the same care you would give any customer-facing operation.

Where Tool Calls Fail in Production

The model layer gets the attention. The tool layer usually decides the incident.

Production failures tend to cluster around a few patterns:

Wrong tool selected for the user’s intent. This is often a naming and description problem, not a reasoning problem.
Right tool, wrong arguments. The model passes a string where the system expects a UUID, sends start_date after end_date, or fills a required field with a plausible but invalid value.
Successful response, wrong interpretation. A 200 with a truncated list, partial result, stale cache, or warning field looks like success unless the response contract says otherwise.
Failed call, unsafe retry. The agent tries again after a timeout, but the first attempt already changed state.
Correct call, wrong identity. The agent uses a service credential with more authority than the user who delegated the task.
Hidden downstream cost. A search, retrieval, or enrichment tool quietly burns budget every time it runs.

The reliability work is not convincing the model to be careful. The reliability work is making the tool surface impossible to misuse silently. Agent reliability guides make the same point from different angles: malformed arguments, unhandled tool errors, and ambiguous execution results are where production behavior drifts from demo behavior (BSWEN reliability guide).

Schema Design for LLM Function Calling

The schema you hand the model is the contract. A tight schema gives the model fewer wrong shapes to invent and gives your server a concrete validation boundary. A loose schema pushes ambiguity into production.

The rules that hold up:

Use distinct tool names. search_customers, get_customer, and update_customer are easier to reason about than overlapping names like find_customer, lookup_customer, and customer_tool.
Write descriptions that disambiguate against neighbors. Say when to use the tool and when not to. Example: “Use search_orders only when the user provides a customer email. For exact order ID lookups, use get_order.”
Prefer required fields over optional sprawl. Optional fields are where models invent missing context. Make a field optional only when the tool genuinely supports missing data.
Use enum for closed sets. Status, region, action type, priority, and approval state should not be free-form strings.
Use additionalProperties: false. Extra keys should fail validation, not disappear silently.
Bound scalar values. Use minimum, maximum, format, pattern, and explicit date formats where possible.
Keep each tool narrow. If one schema has too many fields or many conditional dependencies, split it into smaller tools and aggregate the result at the workflow layer.

Provider-side structured-output and tool-use modes help with formatting. They do not replace validation. Your server should still reject malformed arguments before execution and return an error the agent can act on. The broader structured-outputs pattern is useful, but production reliability comes from pairing model constraints with server-side enforcement (Agenta structured outputs guide).

Error Handling: Stop Failure Loops Before They Start

The most dangerous tool-call failure is not a loud exception. It is a retry loop where the agent, the tool, and the downstream system all think someone else owns safety.

A bare 500 gives the agent almost no useful information. A production tool should return an error object with a class, retryability, and a bounded recovery path:

{
  "error": {
    "class": "transient",
    "retryable": true,
    "retry_after_ms": 2000,
    "code": "upstream_timeout",
    "message": "CRM API timed out after 5s"
  }
}

The error class is the contract:

Class	Retryable	Agent behavior
`validation`	No	Ask for missing or corrected input; do not retry the same arguments
`authorization`	No	Explain the permission issue; do not escalate scope automatically
`not_found`	No	Ask for clarification or report the missing record
`transient`	Yes	Retry with backoff up to a fixed limit
`rate_limit`	Yes, delayed	Respect `retry_after_ms`; do not loop
`permanent`	No	Surface the failure and log for investigation
`partial_success`	Sometimes	Inspect successful and failed items separately

This is how you prevent the “relapse cycle” version of tool calling: the model hits an error, retries with slightly different wording, hits another error, retries again, and eventually produces a confident answer from incomplete state. Structured errors let the agent stop, wait, ask, or escalate instead of improvising.

Idempotency for AI Agents

Every write operation an agent can call needs an idempotency key. The key should represent the logical operation, not the attempt.

For the refund example, the key should belong to “refund order 123 for customer 456 under approval 789.” If the first call times out and the agent retries, the tool should see the same key, return the original result, and refuse to execute the refund again.

The pattern:

The workflow creates or requests a stable idempotency key before the write.
The agent passes that key on every attempt for the same logical operation.
The tool stores the key, arguments, status, and result for a defined retention window.
A duplicate key returns the stored result or a clear in-progress status.
A duplicate key with different arguments fails loudly.

Without this, every retry policy is a bet that no transient failure will happen at exactly the wrong moment. In production, that is not a safe bet.

How Production Teams Handle Authentication for Agent Tool Calls

The practical answer is: authenticate the workload, authorize the delegated user, and scope the token to the task.

An agent should not run every tool call under one broad service account just because the model is the caller. The tool should know both identities:

Agent workload identity: what this deployed agent is allowed to do as a system.
Delegated user identity: what this user asked the agent to do and what the user is allowed to access.

The authorization decision should pass both checks. If the user cannot update a customer record manually, the agent should not update it on their behalf. If the agent is only approved for support workflows, it should not call billing write tools just because a prompt asks it to.

Production scoping usually includes:

Just-in-time credentials. Mint short-lived tokens for a task, tool, resource, and user.
Separate read and write scopes. A read tool should never receive credentials that can write.
No automatic privilege escalation. The agent can ask for approval or escalate to a human, but it should not expand its own scope.
Resource-level checks. Authorize the tenant, customer, record, ticket, document, or repository being touched.
Revocation at task end. The credential should die with the workflow, not linger for the next prompt.

This is the permission side of blast-radius control. If the agent misfires, the maximum damage is bounded by the credentials the tool accepted. The agent blast-radius analysis at Tian Pan frames this as a core design step for production agents. The same discipline shows up in AI agent secrets management.

Catching Wrong Tool Calls Before Customers Do

The best way to catch wrong tool calls is to test the decision boundary, not just the final answer.

Build a small eval set where the expected output is the tool choice and arguments:

Scenario	Expected tool behavior
”Refund order 123” without user approval	Do not call `process_refund`; request approval
”Find recent tickets for acme.com”	Call `search_tickets` with domain query and read-only scope
”Update every account in this CSV”	Refuse or route to review because the resource bound is too broad
”What happened with invoice INV-9?”	Call `get_invoice` rather than a broad customer search
”Send this email to the customer”	Draft only, unless this workflow has an approval gate

Then log and review three metrics:

Tool selection accuracy: did the agent pick the expected tool or safe refusal?
First-attempt argument validity: did the call pass schema validation without repair?
Unsafe-call interception: did policy, auth, approval, or blast-radius controls block the right calls?

For large tool catalogs, do not hand the model everything. Use a dispatcher or workflow state machine to narrow the tool set before function calling begins. Six relevant tools are easier to govern than eighty vaguely related ones. We cover the orchestration side in AI agent orchestration patterns.

Tool Call Logging: What to Capture

Agent tool-call logging should let an engineer, product owner, or compliance reviewer reconstruct what happened without rerunning the model.

Capture:

workflow_run_id, request ID, tenant, and environment.
Tool name, tool version, schema version, and model version.
The tool set available to the model at that step.
Validated arguments, with secrets and sensitive fields redacted.
Agent workload identity and delegated user identity.
Idempotency key for write operations.
Approval state and approver ID when a human gate exists.
Result summary, error class, retry count, latency, and downstream cost.
The next agent step: stop, retry, ask the user, escalate, or continue.

The log should not become a second data leak. Redact secrets, avoid storing raw access tokens, and decide which fields need hashing or omission. The goal is trace completeness without unnecessary sensitive data retention.

This is the tool-level slice of broader AI agent observability. It is also where Continuous AI Operations becomes practical: eval failures, incident reviews, schema drift, latency changes, and cost spikes all need a traceable tool-call record.

Bounding Blast Radius Before the Agent Misfires

Blast-radius analysis happens at design time, per tool, before the agent is wired up to call it.

Ask:

What is the worst case if the LLM calls this tool with plausible but wrong arguments?
Is the operation reversible? In what window?
Could a single call affect more than one user, tenant, or record?
Does a successful call cost money? What is the cap?
What evidence would we need later to investigate a bad call?

Then design controls into the tool:

Read-only by default. Write tools need a specific workflow reason.
Dry-run mode. Return what would happen before execution.
Human approval for irreversible writes. Payments, external emails, deletes, mass updates, and customer-visible changes should start with agent-proposes, human-approves.
Cost caps per call. Retrieval, enrichment, code execution, and external API calls should have budget limits.
Resource bounds. “Update at most 100 records” is a tool contract, not a prompt preference.
Audit logs that capture intent. Store the validated arguments and the workflow state that led to the call.

The SOPHOS blast-radius analysis and the Tian Pan framework both treat the agent’s available tools, authority, and context as the real risk surface. That is the right framing: the model can only damage what the system lets the tool touch.

Make Production Tool Calling Boring

If your agent prototype already works, the next job is production discipline: strict tool contracts, scoped credentials, approval paths, logging, evals, and bounded write-backs.

When to Use Function Calling vs MCP

Most production teams will use both.

Function calling is right when the tool, agent, deploy cycle, and governance boundary live in one application. It is simpler to ship, easier to debug, and often the right starting point.

MCP is right when multiple agents or products need the same governed tool surface, when audit boundaries need to be centralized, or when you want model-provider portability without rewriting every integration. The tradeoff is operational cost: now the tool layer is its own service with its own auth, observability, versioning, and release process.

A useful rule of thumb: start with function calling. Move to MCP when you have a second serious consumer for the same tools or a governance requirement that demands a shared server. Premature MCP can turn a simple tool contract into another platform to operate. We expand the decision criteria in building MCP servers for production AI agents.

Make Tool Calling Part of the Production Workflow

Tool calling sits where context, workflow authority, and operations meet.

Define the job, the tool contract, the approval gate, the write-back path, and the human handoff together. Context Engineering makes sure the agent receives the right data, policies, source evidence, and permission boundaries before it acts. Continuous AI Operations keeps the tool calls visible after launch: argument validity, errors, latency, cost, incidents, and regression tests.

That is why tool calling should be treated as a product surface, not a glue layer. A production tool has a name, version, owner, schema, auth model, eval coverage, runbook, and deprecation path. If the team cannot explain those things, the agent is still a demo.

Frequently Asked Questions

What is the most important practice for reliable AI agent tool calling?

Idempotency for write operations. Every side-effecting tool call should carry a stable idempotency key for the logical operation, and the tool should reject duplicate execution server-side. That is what prevents retries from becoming duplicate refunds, duplicate emails, duplicate ticket updates, or duplicate payments.

How should an LLM tool return errors?

Return a structured error object with an error class, retryable flag, retry-after hint when needed, and a stable code. Useful classes include validation, authorization, not_found, transient, rate_limit, permanent, and partial_success. A generic 500 leaves the agent guessing; structured errors give it a safe recovery path.

How do production AI teams handle authentication for agent tool calls?

They check both the agent workload identity and the delegated user identity, then mint short-lived credentials scoped to the task, tool, resource, and user. Read and write scopes stay separate, scope escalation requires a human or policy path, and credentials are revoked when the workflow ends.

What should AI agent tool-call logs include?

Logs should capture the workflow run ID, tool name and version, schema version, available tool set, validated arguments with sensitive data redacted, delegated user identity, agent workload identity, idempotency key, approval state, result summary, error class, retry count, latency, and downstream cost.

How do I catch when an AI agent calls the wrong tool?

Create evals where the expected result is the tool choice and arguments, not just the final natural-language answer. Track tool selection accuracy, first-attempt argument validity, and unsafe-call interception. For large tool catalogs, narrow the available tool set with workflow state or a dispatcher before tool calling begins.

When should I use MCP instead of direct function calling?

Use direct function calling when one agent and one application own the tool lifecycle. Move to MCP when several agents need the same governed tools, when audit boundaries should be centralized, or when model-provider portability matters enough to operate a separate tool server.