An AI agent kicks off a 40-step workflow. Step 23 is a call to a flaky vendor API. The pod gets evicted mid-call. The agent has already spent $4 in LLM tokens, written half a record to your CRM, and sent one of two confirmation emails. The retry framework you bolted on top fires the whole workflow again from step one. The customer gets a second email. Finance gets a duplicate invoice. Your on-call gets paged at 2 a.m.
This is the failure mode that durable execution exists to eliminate. Not better prompts. Not bigger models. A runtime guarantee that the workflow will resume at the exact step it failed on, with completed work intact, even if every machine in your cluster restarts.
Durable execution is not a new idea - it is the same primitive that has powered Uber’s dispatch, Coinbase’s transfers, and Stripe’s webhooks for years. What changed is that AI agents broke every assumption stateless request-response infrastructure was built on. Agents are long-running. They are probabilistic. They make external side effects. They cost real money per step. Production teams are now rediscovering durable execution because nothing else works.
This guide is for engineers and CTOs deciding how to build agents that survive contact with production. It is part of the larger question of why your AI experiments are failing - the gap between an impressive demo and a system that runs unattended for months.
What Durable Execution Actually Means
Durable execution is a programming model where the runtime, not your application code, guarantees that a function runs to completion exactly once, even across crashes, restarts, deploys, and network failures.
The mechanism is straightforward. Every meaningful step in your workflow - an LLM call, a tool invocation, a database write - is recorded as an event in a durable log before it returns. If the process dies, a new worker replays the log to reconstruct in-memory state, then resumes from the next un-executed step. Completed work is never repeated. Pending work is never lost.
The Three Guarantees
A durable runtime provides three guarantees that ordinary code cannot: (1) State persistence - workflow variables survive process death; (2) Exactly-once side effects - external calls record their results so they are not repeated on replay; (3) Resumability - execution continues from the failure point, not from the start.
This is not the same as retry logic. Retry logic restarts the operation that failed. Durable execution remembers everything that succeeded before the failure and resumes from there. The difference is the difference between a billing system that occasionally double-charges and one that does not.
Why AI Agents Forced the Rediscovery
Traditional web requests are short, stateless, and idempotent by convention. Durable execution was overkill for most of them. Two things broke that pattern.
Agents run for minutes to hours, not milliseconds. A research agent that browses 50 pages, synthesizes findings, drafts a document, and waits for human approval can easily live for an hour. Any pod restart in that window kills the work. According to Inngest’s analysis, AI agents introduce multiple points of failure that traditional retry logic cannot handle, and durable execution provides the automatic state persistence, automatic retries, and workflow resumption that make agents production-ready (Inngest, 2025).
Agents have expensive, non-replayable side effects. Every LLM call costs money and produces a non-deterministic output. You cannot replay a Claude or GPT call and get the same answer. You cannot send an email twice. You cannot charge a credit card again because the worker crashed. The output of every external action must be recorded the first time and reused on recovery.
The result is a converging consensus across the agent runtime community: checkpointing is now standard in production agent frameworks, with LangGraph, Temporal, and Dagster all shipping first-class checkpoint primitives (AppScale, 2026). The frameworks are converging because the failure modes are universal.
The Four Primitives You Cannot Skip
Whatever runtime you choose, four primitives are non-negotiable for production agents.
Checkpointing
Checkpointing is the act of saving execution state at well-defined boundaries - typically after each meaningful step. The checkpoint records what step ran, what arguments it received, and what it returned. On crash, a new worker reads checkpoints in order to reconstruct workflow state.
The naive implementation - serialize the whole agent context and dump it to Redis - fails fast in production. Agent context windows can be hundreds of kilobytes. Checkpointing the full context on every step pushes you into payload size limits and makes replay slow. The right pattern is to checkpoint deltas and large payloads to external storage (S3, Postgres LOBs) while keeping the event log lean.
Replay
Replay is how a durable runtime reconstructs workflow state after a failure. The runtime walks the event log from the start, skipping any step that already recorded a result and re-running only the next pending step. Done correctly, replay is invisible to the workflow code - your function looks like it ran straight through, even if a dozen pods died along the way.
Replay imposes one rule that catches teams off guard: your workflow code must be deterministic, but your side effects need not be. The workflow function itself - the orchestration logic - must produce the same execution path given the same recorded events. Anything non-deterministic - LLM calls, current time, random IDs, HTTP requests - must be wrapped as a “step” or “activity” so its result is recorded once and reused on replay.
Idempotency
Idempotency is the property that a side effect can be invoked many times with the same result as invoking it once. Durable execution guarantees exactly-once at the workflow level, but only if the underlying side effects cooperate.
The classic example is a database write. If your “create order” step succeeds against the database but the process crashes before the result is written to the event log, replay will call “create order” again. Without idempotency keys, you now have two orders. With an idempotency key derived from the workflow run ID and step number, the second call is rejected by the database or by an idempotent middleware layer.
Every external action - HTTP POST, queue publish, payment charge, email send - needs an idempotency strategy. This is engineering work the runtime cannot do for you. Lock it down at the activity boundary.
Retries with Backoff
Durable runtimes handle retries declaratively. You annotate an activity with a retry policy (max attempts, initial interval, backoff multiplier, max interval, non-retryable error types), and the runtime executes it. Permanent failures - validation errors, auth errors - bubble up as workflow exceptions. Transient failures - timeouts, 5xx errors, rate limits - get retried with exponential backoff until they succeed or the policy gives up.
The combination matters: checkpointing makes retries cheap because they only re-run the failed step, not the whole workflow.
flowchart LR
A[Step 1: LLM Plan] --> B[Step 2: Tool Call]
B --> C[Step 3: DB Write]
C --> D[Step 4: LLM Synth]
D --> E[Step 5: Email Send]
B -. crash .-> X[Worker Dies]
X -.->|new worker reads log| Y[Replay]
Y -->|step 1 done, skip| A
Y -->|step 2 done, skip| B
Y -->|resume here| C Async Patterns for Long-Running Agents
Long-running agents force a shift in how you think about every part of the system. A workflow that lives for an hour cannot hold an HTTP connection open. A workflow that waits for human approval cannot occupy a thread for three days. The runtime, the queue, and the timeouts all have to cooperate.
Queues, Not Sync Calls
The first rule: every agent invocation goes through a durable queue. The HTTP request that triggers an agent should enqueue a workflow ID and return immediately. The client polls or subscribes for completion. Synchronous HTTP-bound agent calls are an anti-pattern - the first long LLM response or transient retry takes the connection past the load balancer timeout and the work is lost.
For real-time UX (streaming chat), the front-end consumes events from a pub/sub channel while the durable workflow runs in the background. The workflow publishes streaming tokens as it produces them; the client receives them through the channel, decoupled from the workflow’s lifecycle.
Timeouts at Three Levels
Production agents need timeouts at three levels, and getting one wrong is a common cause of pages.
| Timeout Level | What It Bounds | Typical Range |
|---|---|---|
| Activity timeout | Single LLM call, tool invocation, DB query | 10s – 5 min |
| Heartbeat timeout | Liveness check for long activities | 30s |
| Workflow timeout | Total workflow runtime including waits | 5 min – 30 days |
Activity timeouts protect against a single hung call. Heartbeats let the runtime detect a dead worker before the activity timeout elapses, so retries can fire sooner. Workflow timeouts cap the total lifespan, including time spent waiting on humans or external events. Without explicit workflow timeouts, abandoned workflows accumulate in the runtime forever.
Retries That Respect Cost
Retries are seductive but dangerous in agent contexts. Every retry of an LLM call costs real money. A naive “retry forever with 1-second backoff” policy on a Claude Opus call can burn through a budget in minutes when the upstream provider has a degraded incident.
Production retry policies for LLM activities should:
- Use exponential backoff with jitter (initial 2s, multiplier 2.0, max interval 60s)
- Cap max attempts (5-10 is sane for most cases)
- Mark validation errors, auth errors, and content-filter errors as non-retryable
- Track cost per retry attempt and feed it into your LLM cost attribution system
- Fall back to a cheaper model after N transient failures, not the same model forever
Human-in-the-Loop as a First-Class Wait
The killer feature of durable execution for agents is the ability to suspend a workflow waiting for an external signal - a human approval, a callback, a webhook - for arbitrary durations without consuming compute. The workflow code looks like a blocking await human_approval(), but the runtime parks the workflow, frees the worker, and resumes when the signal arrives, even if days pass and every pod has restarted in between.
This is what makes durable execution the natural substrate for approval workflows and human oversight patterns. The workflow does not care if the approval arrives in 30 seconds or 30 days.
The Runtime Landscape, Without the Marketing
Four runtimes dominate production agent durability discussions in 2026: Temporal, Inngest, Restate, and DBOS. Each makes different tradeoffs. There is no universal best. There is a best fit for your team and architecture.
| Runtime | Model | Strength | Weakness |
|---|---|---|---|
| Temporal | Self-hosted or Cloud cluster; SDKs in Go, Java, Python, TS | Most mature, deepest control over retries and timeouts, battle-tested at scale | History saturation from large LLM payloads; ops overhead of running a cluster |
| Inngest | Serverless-native, function steps as primitives | Fast path from existing code to durable functions; AI-aware primitives like step.ai.infer | Vendor lock-in; less control for complex saga patterns |
| Restate | Lightweight sidecar; virtual objects with per-session state | Per-session state and exactly-once tool execution as first-class primitives; simple ops | Smaller ecosystem; newer with fewer war stories |
| DBOS | Library imported into your service; Postgres as the durable backend | Lowest operational footprint; runs in your existing Postgres; no separate cluster | Throughput ceiling is your Postgres; less suited for very high concurrency |
The honest summary from practitioner reviews: most backend services in 2026 can ship durable execution with DBOS and graduate to Temporal if and when they hit the wall (Dev Note, 2026). Temporal is the strongest choice for long multi-step workflows where the history of events needs to survive days or weeks, though teams typically need payload codecs to offload large LLM payloads to external storage (AppScale, 2026). Restate is the right choice when per-session state and exactly-once tool execution must be first-class primitives. Inngest is the fastest path from existing serverless code to durable workflows.
LangGraph also ships checkpoint primitives, but LangGraph is an agent framework with persistence, not a general durable runtime. For agents that live inside a broader transactional system - the more common case once Engine 2 buyers get involved - a runtime like Temporal or DBOS handling the whole workflow is the cleaner architecture.
Durable Execution Is the Boring Layer That Makes Agents Work
If your agents fail mid-workflow, lose state on deploys, or charge customers twice when something retries, you need a durable runtime. metacto helps mid-market and enterprise teams pick the right runtime, design checkpoint boundaries, and ship agents that survive production. Talk to us about your agent architecture.
What Breaks in Production
Pattern-recognition from the field. These are the failures we see most often when teams roll their own durability or use a runtime without understanding the rules.
Non-deterministic workflow code. Calling datetime.now() or uuid4() directly inside workflow logic (not inside an activity) means replay produces a different execution path than the original. Symptoms: workflows that complete on the first run but fail with confusing errors after a worker restart. Fix: wrap all non-deterministic calls as activities.
Missing idempotency keys on external calls. The crash window between “side effect succeeded” and “result written to event log” creates duplicate writes on replay. Symptoms: duplicate orders, double-charged customers, double-sent emails. Fix: derive idempotency keys from the workflow run ID and step name, and enforce them at the receiving system.
Oversized event history. Agent context payloads pushed directly into the event log eventually hit size ceilings and slow replays to a crawl. Symptoms: workflows that work fine for the first week and degrade as histories grow. Fix: payload codecs that offload large blobs to S3 or Postgres and store only references in history.
LLM call retries without cost guards. A 5-hour upstream outage with default retry policy can spend a month’s API budget overnight. Symptoms: a finance ticket. Fix: bounded retries, cost-aware fallbacks, and circuit breakers that escalate to a cheaper model or to humans after repeated failures.
Treating durable execution as a substitute for orchestration design. A durable runtime makes a bad workflow durable. It does not make it correct. The shape of the workflow - the orchestration pattern you choose and whether you model it as an explicit state machine or a prompt loop - is the prior decision. Durability comes after.
Where Durable Execution Fits in the Stack
Durable execution is the runtime layer of the production AI agent stack. It sits between your orchestration logic (the workflow shape) and your infrastructure (compute, queues, databases). Above it sits the agent framework. Below it sit your activities - the actual LLM calls, tool invocations, and database writes that do work.
It is a different concern from workflow orchestration shape. Sequential, parallel, and conditional workflow orchestration patterns describe how activities relate to each other. Durable execution describes how those activities survive failure. You need both. You choose them independently.
It is also a different concern from observability. Durable execution makes the system recoverable; observability makes the system legible. Most durable runtimes expose event histories that double as a tracing primitive, which is useful, but real observability requires more than what the runtime gives you out of the box.
The teams that ship production agents that actually work treat durable execution as a non-negotiable foundation, not an afterthought. The teams that treat it as something to bolt on later end up rebuilding it under pressure six months in. We have watched both. The second one is more expensive.
This is one layer of the system underneath the chat box - the gap between an impressive AI pilot and a production-ready system that survives the next year of operation. If you are deciding what that system should look like for your business, that is exactly the work of our Operational AI practice.
Frequently Asked Questions
What is durable execution for AI agents?
Durable execution is a programming model where the runtime guarantees that a workflow runs to completion exactly once, even across process crashes, deploys, and network failures. For AI agents, this means a long-running workflow can survive worker restarts, pod evictions, and infrastructure failures without losing state or re-running completed steps. It works by recording every meaningful step (LLM call, tool invocation, DB write) to a durable log, then replaying that log to resume from the failure point.
How is durable execution different from retry logic?
Retry logic restarts the operation that failed - if a 40-step workflow fails at step 23, retries restart from step 1, repeating all 22 completed steps and their side effects. Durable execution remembers everything that succeeded and resumes from step 23, with completed work intact. The difference is the difference between a billing system that occasionally double-charges and one that does not.
Which durable execution platform should I use for AI agents?
It depends on your existing stack. Temporal is the most battle-tested choice for complex multi-step workflows but requires running a cluster. DBOS runs as a library on top of Postgres and is the lowest-friction option for teams that already use Postgres. Inngest is fastest for serverless-native teams. Restate excels when per-session state and exactly-once tool calls must be first-class primitives. Most teams in 2026 can ship with DBOS and graduate to Temporal only if they hit scale walls.
Do I need durable execution for short-running AI agents?
Probably not. If your agent is a single LLM call returning in under 10 seconds with no expensive side effects, a normal request-response handler is fine. Durable execution earns its keep when workflows run for minutes to hours, when they make multiple external side effects, when they wait for human approval, or when the cost of a duplicate or lost workflow is high. Most production agent systems eventually need it.
How does idempotency relate to durable execution?
Durable execution guarantees exactly-once execution at the workflow level, but only if the underlying side effects cooperate. There is always a small crash window between an external call succeeding and its result being recorded in the event log. If replay re-issues that call, the receiving system will see a duplicate unless it enforces an idempotency key. Every external action - HTTP POST, payment charge, email send - needs an idempotency strategy. This is engineering work the runtime cannot do for you.
Can durable execution handle human-in-the-loop steps?
Yes - this is one of its killer features. A durable workflow can suspend waiting for an external signal (human approval, webhook, scheduled event) for arbitrary durations without consuming compute. The workflow code looks like a blocking await, but the runtime parks the workflow, frees the worker, and resumes when the signal arrives, even if days pass and every pod has restarted in between.
What is the biggest mistake teams make with durable execution for agents?
Treating workflow code as if it were regular code. Durable runtimes require workflow logic to be deterministic - the same recorded events must produce the same execution path on replay. Calling datetime.now(), uuid4(), or any HTTP client directly from workflow logic breaks replay. All non-deterministic operations must be wrapped as activities so their results are recorded once and reused on replay. Teams that miss this rule see workflows that work on the first run and fail mysteriously after the first restart.