Context Window Management - Designing Around the Constraint

Type
Article
Published
2026-04-06
Aliases
token budget, context management, context economics
Athanasius Kircher's 1646 engraving of a camera obscura — a darkened chamber with a small aperture projecting an inverted image of an outdoor scene onto an interior surface.
Camera obscura, from Athanasius Kircher's Ars Magna Lucis et Umbrae, 1646. Constraint as design language — the whole world rendered through a finite aperture. The context window is the same problem in software. Source: Wikimedia Commons · public domain.
Summary

Every design decision in Claude Code — concise CLAUDE.md files, on-demand skills, sub-agent isolation, LSP integration — traces back to one physical constraint: the context window is finite. Understanding token budgets as an architectural discipline turns a limitation into a design language.

Overview

A context window is the total number of tokens an LLM can consider at once — its working memory. Everything the model needs to reason about a task must fit inside this buffer: the system prompt, conversation history, file contents, tool results, and the response itself. Once the window fills, older material gets compressed or dropped.

This isn’t a minor implementation detail. It’s the single constraint that shapes how every layer of the 4-layer architecture is designed. As Prakash Sharma puts it, most developers “open the terminal, write a prompt, and expect magic” — they treat context as an infinite resource and wonder why sessions degrade. The practitioners who extract the most from Claude Code treat context not as a bucket to fill but as a budget to manage.

Key Concepts

  • Context Window — the fixed token buffer itself
  • CLAUDE.md — persistent memory that competes for context space
  • Skills — on-demand instruction loading
  • Sub-agents — context isolation through delegation
  • Hooks — deterministic automation that bypasses context entirely

The four strategies

Every context management technique in the Claude Code ecosystem falls into one of four categories: compress, defer, isolate, or bypass.

1. Compress — say more with fewer tokens

Shraddha Bharuka is explicit about this: CLAUDE.md is “not a knowledge dump” — it should contain purpose, repo map, and rules, nothing more. The reason is token economics. A CLAUDE.md file loads into every interaction. At roughly 1.3 tokens per word, a bloated 500-line file could consume 2,000-3,000 tokens before the conversation starts. Multiply that across a long session and the cost compounds.

Bharuka extends this principle with local CLAUDE.md files in subdirectories (src/auth/CLAUDE.md, src/persistence/CLAUDE.md, infra/CLAUDE.md). Rather than loading module-specific rules into the root file where they’d consume tokens in every session, they sit dormant until Claude actually enters that directory. A 40-line auth-module CLAUDE.md costs nothing when you’re working on the frontend.

Kshitij Mishra describes the compounding effect: in week one you correct Claude often, by month three “it behaves like a dev who has worked on the project for a year.” But this only works if the CLAUDE.md stays concise — each correction becomes a terse rule, not a paragraph of explanation. The file grows in knowledge density, not in token count.

2. Defer — load on demand, not upfront

Skills exist precisely because of context pressure. Shraddha Bharuka lists code review checklists, refactor playbooks, release procedures, and debugging flows as examples — all reusable workflows that should be invoked, not permanently loaded.

Nick Spisak takes this further. Individual skills are “just fancy prompts.” The real leverage comes from connected skill systems where skills load and unload independently. His 4-layer production pipeline reduced manual orchestration from 30-40 minutes to 15-20 minutes of review — but critically, no single step loads the entire pipeline into context. Each skill brings its own instructions, does its work, and clears out.

The same logic explains the ENABLE_LSP_TOOL flag that Om Patel highlights. Default file search uses text grep — it pulls entire files into context to find a function definition, taking 30-60 seconds and “sometimes returns the wrong file.” LSP-based navigation queries the language server for the exact symbol, returning only what’s needed in ~50ms. The context saving is dramatic: Om Patel notes it “saves tokens because Claude stops wasting context searching for the wrong files.” A single LSP lookup might use 50 tokens where a grep-based search consumes 2,000.

3. Isolate — give subtasks their own budget

Sub-agents are context management tools disguised as parallelism tools. Kshitij Mishra describes how “large tasks get delegated to sub-agents, keeping the main context clean.” The parent agent specifies the work, the sub-agent completes it in its own context window, and only a compressed summary returns to the parent.

The parent might receive a 500-token summary from a sub-agent that consumed 50,000 tokens reading through a codebase. This is the delegation pattern applied to token economics — the main session stays focused on coordination while the heavy lifting happens in isolated buffers.

Prakash Sharma frames sub-agents as the fourth layer specifically because they enable parallel workflows. But the context benefit is arguably more important than the parallelism: without isolation, a single research task can fill the main context window with file contents, leaving no room for the actual implementation work.

4. Bypass — take work off the context entirely

Hooks represent the most aggressive context strategy: removing work from the LLM entirely. Shraddha Bharuka is direct — “models forget. Hooks don’t.” A pre-commit hook that runs the formatter or blocks changes to protected directories doesn’t consume any context tokens. It fires deterministically, outside the model’s reasoning loop, and only surfaces results if something fails.

This is qualitatively different from asking Claude to “always run the linter before committing.” That instruction lives in context, competes for tokens, and might get compressed away in a long session. A hook is infrastructure — it runs whether or not the model remembers.

The insight is that hooks simultaneously solve two problems: they guarantee deterministic behaviour (safety) and they free context budget for actual reasoning (efficiency). Every rule you can encode as a hook is a rule you don’t need to spend tokens reminding the model about.

The retrieval problem

God of Prompt highlights research that complicates the “just compress everything” instinct. Across 1,540 questions and 9 memory systems, retrieval method drove 20-point accuracy swings while write strategy accounted for only 3-8 points. Raw conversation chunks with zero preprocessing “matched or beat fancy fact extraction and summarization.”

The implication for context management: the expensive summarisation everyone does at write time may be throwing away context the model could have used. What matters is surfacing the right information at the right time — hybrid retrieval (semantic + keyword + reranking) cut failures in half. The correlation between retrieval quality and accuracy was r=0.98.

This maps directly onto the defer strategy. Tools like QMD (highlighted by Tom Crawshaw) make past sessions searchable, letting Claude pull specific context on demand rather than carrying everything in the active window. The architecture works best when retrieval is excellent — not when everything is pre-compressed into the smallest possible prompt.

How the layers compose

The four strategies form a hierarchy of context efficiency:

StrategyMechanismToken costExample
BypassHooks, automationZeroPre-commit linter, directory protection
IsolateSub-agentsSummary onlyCodebase research, parallel analysis
DeferSkills, LSP, QMDOn-demandDeploy checklist, symbol lookup, session recall
CompressConcise CLAUDE.mdMinimal, always loadedProject rules, conventions

The most context-efficient systems push as much as possible toward the top of this table. Anything deterministic becomes a hook (zero cost). Anything exploratory goes to a sub-agent (summary cost). Anything procedural becomes a skill (on-demand cost). Only the irreducible core — project identity, key rules, active conventions — lives permanently in context via CLAUDE.md.

Practical applications

For researchers: Literature review is a classic context-overflow task. Reading ten papers in a single session will exhaust any context window. The sub-agent pattern — one agent per paper, summarising findings back to a coordinator — keeps the synthesis work manageable. The coordinator’s context holds structured summaries, not raw PDFs.

For teams: A shared CLAUDE.md that grows organically can become a token liability. Periodic compression passes — distilling verbose explanations into terse rules — pay dividends across every session for every team member. As Anish Moonka notes from the Boris Cherny interview, Anthropic has seen 200% productivity gains per engineer. But those gains assume the tooling is configured for efficiency — unlimited tokens without context discipline just means burning through budget faster.

For complex workflows: Nick Spisak’s connected skill systems are context management in disguise. A 4-skill pipeline where each skill loads and unloads independently uses far less peak context than a single monolithic prompt containing all four procedures.

Limitations and open questions

Context windows are growing — Claude’s has expanded significantly across model generations. Does aggressive context management still matter when the window is 200K tokens? The answer appears to be yes, for two reasons: larger windows increase cost per interaction (token pricing scales linearly), and empirical evidence suggests model attention degrades over very long contexts even when the technical limit isn’t reached.

The God of Prompt research raises an open question: if retrieval quality matters far more than write-time compression, should the emphasis shift from making CLAUDE.md files shorter to making retrieval tools like QMD better? The current ecosystem optimises heavily for compression. The research suggests the bigger gains may be in retrieval.

Sources

  • @BharukaShraddha — 4-layer architecture, CLAUDE.md as concise config, hooks as deterministic guardrails, local CLAUDE.md pattern
  • @PrakashS720 — 4-layer system framing, sub-agents for parallel workflows
  • @om_patel5 — LSP integration, ENABLE_LSP_TOOL flag, token savings from precise lookups
  • @NickSpisak_ — connected skill systems, on-demand loading, pipeline composition
  • @AnishA_Moonka — Boris Cherny interview, 200% productivity gains, unlimited tokens philosophy
  • @tomcrawshaw01 — QMD persistent memory, session searchability
  • @DAIEvolutionHub — CLAUDE.md compounding effect, sub-agent delegation for clean context
  • @godofprompt — memory retrieval vs write strategy research, hybrid retrieval findings