The model has a 1M-token window. The team interpreted that as a license to stop thinking about what goes into it. Three months later the agent is slow, expensive, and quietly wrong — wrong in the way that doesn’t trip evals because the answer is plausible, just not grounded in the right slice of the input.
This is the modal failure of production LLM systems in 2026. Not hallucination in the dramatic sense — the model confidently fabricating a court case — but a more subtle erosion: the answer is shaped by everything in the prompt, the prompt was stuffed with everything that might be relevant, and the signal got diluted. Context management is the discipline of preventing this. It is not prompt engineering. It is the system around the prompt: what gets written outside the window, what gets selected into it, what gets compressed when it’s too much, and what gets isolated into separate calls.
This guide is for engineering teams whose agents have moved past the prototype stage and started exhibiting the failure modes that don’t appear at low scale: rising token costs, declining answer quality on long sessions, brittle behavior when the input expands. It covers the write/select/compress/isolate framework that has become the de facto vocabulary of context engineering, the empirical reality of context rot, and the practices that hold up across model upgrades. It is part of the larger question of why AI experiments fail — and it is one of the clearest examples of the prompt not being the product.
Why “Bigger Window” Did Not Solve Context
The premise behind huge context windows was that they would make context engineering obsolete: dump everything in, let the model figure it out. The empirical record is unkind to this premise.
Research published by Chroma in 2025 — Context Rot: Why LLMs Degrade as Context Grows — tested 18 frontier models across eight input lengths and found that every single one degrades as you add tokens. The well-known “lost in the middle” effect, documented across GPT-4, Claude, and other model families, shows accuracy dropping more than 30% when relevant information sits in the middle of context rather than at the beginning or end. Even a 1M-token window experiences degradation at 50K tokens.
Anthropic’s own evaluations, surfaced through their context editing and memory tooling, found that context editing alone delivered a 29% performance lift; combining it with a memory tool reached 39%. In a 100-turn web search evaluation, context editing reduced token consumption by 84% while enabling workflows that would otherwise have failed from context exhaustion.
The takeaway: large windows raised the ceiling on what is possible but did not change the economics or the quality dynamics. The systems that win are the ones that engineer the window deliberately. This is exactly the gap between an impressive AI demo and a production system that ships.
Context Engineering, Not Prompt Engineering
Prompt engineering is the craft of writing the instructions inside the window. Context engineering is the system that decides what is in the window in the first place: which retrieved chunks, which memories, which tool outputs, which prior turns. The prompt is one piece. The context is everything that surrounds it.
The Write / Select / Compress / Isolate Framework
The vocabulary that has emerged across LangChain, the broader agent community, and increasingly the model vendors themselves names four operations a system can perform on context. Together they are the four levers of context engineering. (LangChain’s writeup is the canonical reference.)
Write: store context outside the prompt
Writing means persisting information somewhere other than the next inference call — scratchpads, memory stores, durable workflow state, intermediate artifacts. The agent does not need to drag everything forward in its context window if it can write down what it learned and retrieve it later.
In practice this means: every long-running agent has a scratchpad, every multi-session agent has a memory layer, and every workflow that spans more than a few turns has externalized state. The alternative — putting all of history in the next prompt — is how token bills explode. See AI agent memory architecture for the persistence patterns that sit underneath the write step.
Select: retrieve only what is needed, when it is needed
Selecting is the inverse of writing: at inference time, pull the specific pieces of context the model needs for the current step. This is the operation everyone associates with RAG, but it generalizes: select the right tools to expose, the right memories to surface, the right prior turns to include, the right system prompt variant.
Done well, selection is invisible and correct. Done badly, it manifests as the agent missing the obvious detail because the retriever didn’t surface it. Tool selection is increasingly a first-class concern — exposing 40 tools to a model degrades tool-calling accuracy; selecting the 3 relevant tools per step keeps it sharp.
Compress: reduce what must be carried forward
Compression is what you do when even the selected context is too much. Two compression patterns dominate:
- Summarization: replace a long transcript with a model-generated summary. Useful when the literal text is no longer needed but the gist is.
- Hierarchical compression: keep recent turns verbatim, summarize older ones, summarize summaries beyond a horizon. This is how long-running agents avoid linear growth in token spend.
Compression is lossy by definition; what you compress away is gone from the model’s view. The question is whether your compression policy is principled or accidental. Most teams discover the answer the first time the agent forgets a critical constraint that was, at the time, in a summarized transcript.
Isolate: split work across multiple contexts
Isolation moves problems out of a single mega-context and into separate, focused contexts. Sub-agents handle sub-tasks with their own small windows; tool outputs are processed by lightweight models before the result enters the primary agent’s context; long documents are read by a separate “reader” that returns only what is asked.
Isolation is how you keep the supervising agent’s context lean. It is also where the line between context engineering and agent orchestration blurs — splitting work across contexts often means splitting work across agents.
Context Rot: What “Long Context” Actually Looks Like Under Load
If you remember nothing else from this article, remember this: the model gets worse as the input gets longer, even when the input fits in the window. Plan for it.
The Chroma research above is the most comprehensive public benchmark, but the Understanding AI analysis makes the production implication concrete: degradation happens at every increment tested, and the U-shaped attention pattern documented across model families means information placed in the middle of a long context gets attended to less. Architecturally, the RoPE long-term decay property gives a clean explanation for why mid-context tokens carry less attention weight.
For production systems the implications are mechanical:
- Put the most important context near the start or end of the prompt, not buried in the middle.
- Treat the window as a budget, not a ceiling: have a target effective length and a policy for staying near it.
- Re-run quality evals at multiple input lengths, not just the median. Quality at 5K tokens does not predict quality at 50K.
- Measure the cost of context rot, not just the cost of tokens: a 1M-token call that produces a wrong answer is more expensive than two focused 20K-token calls that produce a right one.
This is the same dynamic that makes LLM tracing in production essential: without visibility into prompt size and structure, you cannot tell whether quality is dropping because of model regression, retrieval drift, or context bloat.
Practical Patterns for Each Lever
Frameworks are useful when they translate to specifics. Here is what each lever looks like in production code, not slideware.
Selection patterns that hold up
- Retrieval with a relevance threshold: don’t include the top-K if the top-K are weak matches. Empty retrieval results are sometimes the correct answer.
- Reranking after retrieval: a small reranker dramatically improves selection quality without changing the index. This is one of the highest-ROI changes in the typical agent.
- Tool selection by sub-task: rather than exposing every tool every turn, gate tool exposure on the current state of the workflow.
- Recency-weighted memory selection: when episodic memories tie on similarity, prefer the recent one. The user’s last week matters more than their first month.
Compression patterns that hold up
- Turn-by-turn rolling summary: after N turns, replace turns 1..M with a summary; keep turns M+1..N verbatim. Cheap and effective for chat.
- Tool output compression: tool calls often return verbose payloads. A compression pass that extracts the fields the agent will actually read keeps the working window focused.
- Document chunk summarization at index time: store both the chunk and a summary; retrieve summaries first, then drill into the full chunk only if the summary is selected.
Isolation patterns that hold up
- Reader sub-agents: a separate agent that reads a long document and returns answers to specific questions, never exposing the document to the primary agent’s context.
- Tool-call sub-contexts: tool execution happens in a sub-call with its own model and its own focused prompt; only the result enters the parent context.
- Multi-agent specialization: each specialist has a small, curated context. The supervisor stays small by delegating, not by reading everything itself.
Write patterns that hold up
- Durable scratchpads: agent thoughts and intermediate results land in a key/value store keyed by run ID. This is also what makes durable execution for AI agents recoverable when a step fails.
- Structured memory writes: extracted facts go to typed memory storage (see agent memory architecture), not free-form notes.
- Artifact externalization: large generated outputs (reports, code, plans) live as artifacts with URIs the agent can reference, not as text in the next prompt.
Engineering Context for a Production Agent?
The window is the only thing your model sees. metacto's Context Engineering practice helps teams design the system around the prompt — retrieval, memory, compression, isolation — so your agent ships with the quality the demo promised.
Context as an Engineering Discipline, Not a Side Project
The teams shipping reliable AI in 2026 treat context as a first-class concern with the same rigor as data engineering. That means:
Context is versioned. The set of prompts, retrieval configurations, tool definitions, and assembly logic that produces a context is a versioned artifact, evaluated on a regression suite, and rolled out gradually — handled through prompt versioning for production LLM apps and the prompt registry it implies.
Context is observed. Every model call captures the full context that was sent, including the source of each piece, so failures can be traced to the input rather than blamed on the model. This is the prompt-and-context payload in the OpenTelemetry GenAI semantic conventions that the observability ecosystem is rallying around.
Context is budgeted. Engineering targets are not “fit under the window” but “stay under N tokens on the P95 call.” Anything pushing the budget gets the compression/isolation treatment before it ships, not after the bill arrives.
Context is owned. Someone on the team owns the assembly pipeline: what gets written, what gets selected, what gets compressed, what gets isolated. Without an owner, every developer adds “just one more thing” to the prompt and the context drifts into the unmanageable.
This is the angle behind metacto’s Context Engineering solution — context as the durable infrastructure layer that makes the rest of the AI stack reliable. The same logic informs building context-rich environments for AI agents: the agent is only as smart as the context it can see.
What Context Management Connects To
Context management is the boundary between several adjacent disciplines. The clean separation that holds up under load:
- Memory decides what is persisted across calls; context management decides what is loaded into the current one. See agent memory architecture.
- Retrieval (RAG) is one mechanism for selection. Context management owns the policy; RAG executes part of it.
- Orchestration is the workflow logic; context management is the per-step input curation. They interlock but are not the same thing.
- Caching turns context management into a cost lever — prompt and semantic caching work because the same context appears across calls.
- Cost attribution sees every byte of context as a billable line item; treating it as such drives the discipline. See LLM cost attribution per user and feature.
A Pragmatic Adoption Sequence
For teams whose agents are starting to exhibit the long-context failure modes:
- Measure the current prompt size distribution: P50, P95, P99 by route. If you don’t know the numbers, you don’t know what you’re fixing.
- Add a context budget per workflow: a target number, with alerts when calls exceed it.
- Introduce a rolling summary on chat-style flows: this alone often cuts P95 prompt size in half.
- Add reranking after retrieval: the easiest quality lift available to most teams.
- Tighten tool exposure: limit available tools to the ones relevant to the current step.
- Carve out the first isolation boundary: pick the workflow where a sub-agent or reader pattern makes the supervising context dramatically smaller.
- Then re-evaluate quality and cost together: the win is usually larger than the prediction.
The pattern these steps share is that they are mechanical, measurable, and shippable in days, not quarters. Context engineering is not a research project. It is the difference between an agent that works on the demo dataset and one that holds up when the input is real, varied, and large.
If you’re standing this up across multiple agents, multiple teams, or multiple tenants, the discipline scales by becoming infrastructure. Our Operational AI solutions — and the Context Engineering practice specifically — exist to make context the durable layer it needs to be, rather than a property that every team rebuilds and rebreaks.
Frequently Asked Questions About LLM Context Management
What is the difference between context engineering and prompt engineering?
Prompt engineering is the craft of writing the instructions inside the model call. Context engineering is the system around the prompt — deciding which retrieved chunks, memories, tool outputs, and prior turns enter the window, and which are written outside it, compressed, or handled by separate calls. Prompt engineering optimizes one input. Context engineering optimizes the policy that produces every input.
What is the write/select/compress/isolate framework?
It's the four-lever vocabulary that has become standard for context engineering, popularized by LangChain. Write stores information outside the immediate prompt (scratchpads, memory, durable state). Select retrieves the most relevant pieces at inference time. Compress reduces context that must be carried forward (summarization, hierarchical compression). Isolate splits work across multiple smaller contexts (sub-agents, reader patterns, tool sub-calls).
What is context rot in LLMs?
Context rot is the empirical phenomenon that LLMs perform worse as input length grows, even within their advertised context window. Chroma's 2025 research tested 18 frontier models and found every one degrades as tokens are added, with the well-known 'lost in the middle' effect — accuracy drops over 30% when relevant information sits mid-context. A 1M-token window still degrades at 50K tokens. Bigger windows did not eliminate the need for context discipline.
How do I manage the context window in production LLM applications?
Treat the window as a budget, not a ceiling. Set a target token size per route. Put the most important context at the start or end (not the middle). Use retrieval with relevance thresholds and reranking. Add a rolling summary on chat flows. Externalize tool outputs and large artifacts. Capture full prompts in traces so failures can be diagnosed at the input, not blamed on the model. Re-evaluate quality at multiple input lengths.
Do large context windows eliminate the need for RAG?
No. Large windows make more workflows possible but they don't change the quality dynamics — context rot still degrades performance at long input lengths, and token costs scale linearly with prompt size. RAG remains the right mechanism for selection: pulling only the relevant pieces from a much larger corpus. The combination of large windows with disciplined selection is stronger than either alone.
How does context management connect to AI agent memory?
Memory decides what gets persisted across sessions; context management decides what gets loaded into the current model call. Memory is the storage layer that the 'write' lever populates and the 'select' lever queries. The two interlock: an agent without memory has no meaningful 'write' phase, and an agent without context management dumps memory into the window indiscriminately. They are separate disciplines that share an interface.
What does a context engineering practice look like in production?
Context is versioned — the prompts, retrieval configs, and assembly logic are tracked artifacts evaluated on a regression suite. Context is observed — every model call captures its full input via OpenTelemetry GenAI semantic conventions so failures can be traced to the prompt. Context is budgeted — engineering targets the P95 prompt size, not just 'fit under the window.' And context is owned — someone is responsible for the assembly pipeline rather than every developer adding 'one more thing' to the prompt.