Before you blame the model, audit the harness.
A canary-analysis agent on my team started costing four times what it did at launch. Same model, same number of deployments judged, same logic. The change was that someone had wired it to pull the full Dynatrace problem payload into context on every judgment, and a noisy service was returning 6k tokens of problem detail per call. The agent re-read all of it on every turn of a multi-step judgment. Nobody touched the model. The bill quadrupled anyway.
That is the shape of the whole problem. The complaint everyone has right now is "we're burning tokens," and the reflex is to drop to a smaller model or cap the output. That treats a symptom. The token bill is what shows up in billing. What doesn't show up is that you're using the context window like scratch space you'll never have to clean, when it's a lossy working memory with a signal-to-noise ratio. The more junk you pack in, the worse the model reasons over it.
None of this is the weights' fault. Hand a frontier LLM 4k tokens of exactly the right context and it out-reasons the same model drowning in 90k tokens of scrollback, stale tool output, and dead branches. The intelligence is constant. What you control is the assembly layer: retrieval, compaction, state, ordering. That layer is the harness, and the harness is where wasted tokens come from. Context engineering is just the discipline of deciding what the model sees at each step, on purpose, with the smallest set of tokens that does the job.
So the real question is not which cheaper model to buy. It's why you're paying to re-read your own garbage on every turn. That recurring charge is the context tax. Here's where it comes from, and how to stop feeding it.
Why the bill goes quadratic when the work doesn't
Context growth is not inherently quadratic. It goes approximately quadratic when the entire growing conversation is retransmitted on every turn, which is exactly what a naive chat loop does. Here is the loop, mechanically. Turn one you send 2k tokens and get 500 back. Turn two the request carries the original 2k, the 500, and your new 2k. Turn three carries all of it again. The conversation doesn't cost the sum of the turns. It costs the sum of the running totals, because the model re-processes the whole prefix on every call.
That's why a chat that feels linear bills like something closer to quadratic. Ten turns that each add 1.5k tokens is not ten units of work, it's the triangular number of those steps, roughly fifty-five. The fix is not magic, it's refusing to retransmit the whole history: prune it, summarize it, or cache it. Prompt caching takes the edge off, but only on the stable prefix, and only if you keep that prefix byte-identical across calls. Most harnesses don't.
The mechanism that bankrupts you
Cost per conversation tracks the sum of context sizes at each turn, not the size of any one message. A 50-turn agent that grows context by 1.5k tokens a turn doesn't pay for 50 turns of work, it pays the triangular number of those steps. So the highest-leverage thing you can do is keep per-turn context flat instead of letting it climb. Everything below serves that one goal.
The signal-to-noise problem most teams never measure
Cost is the boring half. The expensive half is that a long, polluted context makes the model dumber, and you should be able to name the two ways it happens in a design review without waving your hands.
Lost in the middle
Models do not weight every position in the context equally. Attention is strongest at the start and end of a long input and weakest across the middle. So this is not about whether a fact fits in the window, it's about whether the model actually uses it. Bury the one detail that matters at token 40k of an 80k dump and the model acts like it never saw it, even though it technically read it.
The consequence is the part teams miss: retrieval quality drops as the relevant information sinks deeper into the pile. A bigger window raises capacity but not utilization. You can double the context limit and still get worse answers, because you've given the model more room to lose the thing you needed in. Pasting the whole doc didn't save you effort, it hid the answer inside the stretch of context the model skims. This is one of the better-replicated findings about long-context behavior. Confidence: high.
Context rot
As junk tokens pile up, retrieval and instruction-following degrade even when it all technically fits. Stale tool results, dead reasoning branches, three failed shots at the same function call, a log file pasted whole when four lines mattered. Every one of them competes with your actual instruction for the model's attention. The model can't tell which tokens are dead. You can. Pruning them is the harness's job, not the model's. Confidence: high on the direction, moderate on how fast it bites, since that depends on the model and how adversarial the noise is.
A bigger context window is not a bigger brain. It's a bigger desk. Pile enough on a big desk and you still can't find your pen. Window size is a ceiling, not a quality floor, and reading "it fits" as "it works" is how teams ship an agent that nails the demo and falls over at turn thirty.
Four common ways teams waste context
These aren't exotic. They're the defaults you get when nobody owns the assembly layer and "make it work" means "paste more in." Read the right column as the invoice: every habit on the left bills you for it.
Six interventions that usually matter more than model selection
Ordered by leverage, not by what's easiest to grab. "Use a smaller model" sits dead last on purpose. It's the first thing everyone reaches for and the weakest fix, because a small model fed a landfill still fails. Fix the context first. Then talk about the model.
Retrieve, don't paste
Stop putting the corpus in context. Put a retrieval step in front of it. Chunk, embed, and at query time pull the three-to-eight chunks that actually match. 80k tokens of "everything" becomes 3k of "the relevant parts," and quality goes up because the model isn't hunting through the middle.
The corpus lives in a vector store or search index. The window holds only the hits. RAG isn't a feature you add later, it's the default posture for anything bigger than a few pages.Compact tool output at the boundary
Never let a raw tool result hit the window unfiltered. The tool returns 4k tokens of JSON, your harness extracts the two or three fields the model needs and hands it forty tokens. The model never sees the rest. This is the single most ignored lever in agent design and the one that separates a demo from a system.
The tool does the work. The harness decides what the model sees about it. That canary agent from the top of the piece needed status, severity, and the affected services from the Dynatrace payload, not the other 6k tokens. Forty tokens would have done it.Summarize and roll up history
When a run gets long, replace the old turns with a compact summary of what was decided and what state matters, then drop the verbatim transcript. Keep the last few turns raw for local coherence, compress everything older into a state block. This is how you hold per-turn context flat instead of letting it climb.
Compaction is a first-class step in the loop, triggered on a token threshold, not a vibe. The model keeps the conclusions and loses the noise that produced them.Externalize state, don't carry it
The agent's scratchpad, task list, and intermediate artifacts go to a file or a store, not the window. The model reads them via a tool call when needed. This is the difference between an agent that carries its whole working memory in-context and dies on overflow, and one that uses context as a register and disk as memory.
Context is for what you need right now, not what you might need eventually. A persistent scratchpad the agent re-reads beats stuffing the whole plan into every turn.Structure for prompt caching
Stable content at the front, variable content at the back. System prompt, tool definitions, fixed instructions: front, byte-identical every call, cacheable. The changing query: back. Shuffle the order or re-paste the persona mid-conversation and you bust the prefix, paying full price for tokens you could have had at a fraction.
A stable prefix is the cheapest win available, and most teams throw it away by interpolating a timestamp into the system prompt, which silently invalidates the cache on every call.Right-size the model per step
Not every step needs your most expensive model. Classification, extraction, and routing run fine on a small fast one. Reserve the heavy model for actual reasoning. But the ordering is non-negotiable: fix context first, then route by difficulty. Cheaping out on the model while feeding it a landfill gets you a system that is both cheap and useless.
It's last because it's the only lever aimed straight at price, and it's the weakest. It does nothing about the quadratic growth or the rot underneath. Swapping models without fixing context just buys you a cheaper way to be wrong.# bad: the model pays to parse the whole blob, every turn it stays in context result = api.get_incident(incident_id) # 4,000 tokens of JSON context.append(result) # good: the harness extracts, the model sees the answer in ~25 tokens result = api.get_incident(incident_id) context.append( f"incident {incident_id}: status={result['status']} " f"sev={result['severity']} owner={result['owner']} " f"affected={','.join(result['affected_services'])}" )
The tool still does full work. The harness decides what crosses into the window - one field, not one blob, multiplied across every turn the result stays in scope.
Treat context as an asset, not a recurring expense
Most prompt advice stops at the single message. The bigger win is structural. When you work with a coding agent or anything that supports persistent instruction files, you stop re-pasting your conventions into every prompt, which is just the re-explaining habit wearing a different shirt. You write the context down once, in files scoped by how often they're needed, and each one loads only when it earns its place. They nest: one always-on, one on-demand, one per task. Same discipline as the levers above, pointed at your own workflow.
AGENTS.md - always loaded
Repo root, loaded at session start, paid for on every turn. Every agentic IDE reads its own always-on rules file - AGENTS.md (the cross-tool standard, read by Cursor, Codex, and others), CLAUDE.md for Claude Code, .cursor/rules for Cursor, .windsurf/rules for Windsurf. Same role in all of them. So it stays lean: stack, build commands, the conventions that are true every single time. This is your stable prefix - keep the order fixed so the cache holds. Anything procedural or occasional gets pushed down a layer.
SKILL.md - loaded on match
A folder of instructions pulled in only when the task matches its description. The trigger lives in the frontmatter; the body stays out of the window until needed. This is L1 retrieval applied to your own instructions - "how we deploy" doesn't sit in context while you're writing tests.
context.md - per task
A short living brief for the current piece of work: goal, what's decided, what's open. You point the agent at it instead of re-narrating the state every turn. This is L4 externalized state - the brief lives on disk, the running history stays lean.
# Sentinel SRE platform ## Stack Python 3.12, FastAPI, OpenTelemetry, Dynatrace, Kubernetes, ArgoCD. ## Commands test: `make test` # run before every commit lint: `make lint` deploy: `argocd app sync sentinel` # never deploy by hand ## Conventions - Tool output is compacted before it hits the model. No raw blobs in context. - Every metric maps to a DURESS dimension or it doesn't ship. - No timelines or week estimates in any doc. Build it now. - INFO in prod, DEBUG by explicit toggle with a TTL. ## Don't - Don't paste full API responses. Extract the fields. See SKILL: tool-compaction. - Don't re-explain the stack every turn. It's here. Read it once.
The always-on layer. Every line is paid for on every turn, so it stays short and declarative. This is the stable prefix - hold the ordering fixed so the cache survives.
--- name: tool-compaction description: Use when adding or editing any tool that returns data to the model. How to compact tool output at the boundary so raw blobs never enter the window. Trigger on: new tool, API wrapper, agent tool result. --- # Compacting tool output at the boundary Rule: the model sees a summary, never the raw payload. 1. Call the API and get the full result in your code. 2. Extract only the fields the model needs to reason or act. 3. Return a short, typed, human-readable string. 4. If the full blob might be needed later, write it to a file and return the path, not the contents. # see SKILL: externalized-state
The on-demand layer. The agent reads the description to decide whether to pull the body into context. Write the description for matching, the body for doing - the body stays out of the window until the task calls for it.
# Task: canary DSL - add latency-budget gate ## Goal Fail a canary when p95 latency exceeds a per-service budget pulled from the SLO config, not a hardcoded number. ## Decided - Budget source: existing slo.yaml, keyed by service. No new file. - Runs in the existing Kayenta judge loop. No new service. - Failure blocks promotion, same path as the error-rate gate. ## Open - p95 at judge time: trace-derived SLI or metric? (Leaning trace. Confirm.) ## Done - Gate config schema (canary/dsl/gates.py) - Unit test for budget parsing
The per-task layer. Externalized working memory for one piece of work. Decided / Open / Done is enough structure. Update it as you go; the running history never has to carry the state.
Same discipline, three scopes
Your always-on rules file - AGENTS.md, CLAUDE.md, .cursor/rules, .windsurf/rules, whatever your IDE reads - is the prefix you always pay for, so keep it minimal. SKILL.md is on-demand retrieval, so the body stays out until the description matches. context.md is externalized state, so you point at it rather than re-narrate. None of these is a dumping ground. They're where you decide, deliberately, what the model sees and when - context engineering pointed at your own workflow, not just your product.
Match the model to the task, not the other way round
L6 in the levers section told you to right-size per step. This is what that looks like in a development workflow. The mistake is not using expensive models — it's using them at the wrong frequency. A frontier model reviewing your PR once a day is a bargain. That same model firing on every keystroke is a budget fire you lit yourself.
Three tiers, three jobs. They're defined by task frequency and the nature of the reasoning required, not by how good you want the answer to be. All three tiers can produce a wrong answer; the question is what kind of wrong and how often the task demands judgment versus pattern completion.
Frontier — buy judgment, not throughput
Ideation, architecture decisions, PR review, ambiguous debugging, anything where the cost of a wrong answer is high or the problem genuinely requires reasoning at the edges. Low frequency, high per-call value. You're paying for judgment on tasks a mid-tier model would get subtly wrong in ways that compound. The cost is justified because the task is rare and the leverage is large.
Use cases: design review, PR critique, post-incident RCA, planning a multi-step refactor, evaluating a tricky tradeoff. Not: generating boilerplate, running repetitive agentic sub-tasks, inline suggestions.Mid-tier — where the volume lives
Autonomous multi-step coding, agentic loops, multi-file edits, scaffold execution, routine generation tasks. This is where a well-engineered harness earns back its cost: mid-tier models with clean, lean context outperform frontier models drowning in a landfill, and they cost a fraction per call. The reasoning is good enough for defined, structured tasks. The savings are real at volume.
Use cases: feature implementation from a spec, test generation, refactoring passes, CI-triggered analysis, agent sub-steps that operate on structured inputs. The context discipline from sections above matters most here — more turns means more exposure to bad harness design.Cheap / free — completion, not reasoning
Keystroke-level tab completion, inline suggestions, single-token fills. The model is doing pattern completion against what's immediately visible, not reasoning over a problem. Cost per interaction should be nearly zero. Running anything more expensive here is the same category of mistake as pasting a full API response — you're paying for capability the task doesn't use.
Use cases: tab completion in an IDE, inline code suggestions, docstring generation from a function signature. If the model needs to "think," it's not a tab-completion task — move it up a tier.Ideation and review
Architecture exploration, PR review, post-incident analysis, high-stakes tradeoffs. Low frequency, high judgment requirement. Frontier tier. You're buying the model's ability to reason about things it hasn't seen exactly before.
Autonomous coding and agentic loops
Multi-file implementation, agentic scaffolding, iterative generation from a spec. Medium-to-high frequency. Mid-tier. Apply all six context levers here — these tasks run long, accumulate history fast, and are where bad harness design bills you hardest.
Inline and tab completion
Keystroke-level suggestions, single-line fills, autocomplete. Very high frequency, pattern-completion task. Cheap or free tier. If this line item is significant on your bill, you've wired the wrong model to it.
The mistake that looks like model loyalty
Running a frontier model in a tight agentic loop — hundreds of turns a day per developer — is the same class of error as pasting the full payload into context. You're paying for capability the task never calls on, at a frequency that turns a per-call rounding error into a line item nobody can explain. The fix is the same as everywhere else in this piece: match the resource to the actual requirement, not to the ceiling of what's available. Context discipline and model tiering are the same idea applied to two different levers.
Measure outcomes or you're optimizing the wrong thing
This is where AI feature work usually collapses. Someone wraps a model call in a UI, it demos well, nobody instruments it, and three months later the bill is five figures with no one able to say whether the thing is any good. If you can't answer the next question with numbers, it isn't a feature, it's a liability with a chat interface.
Here is the mistake almost every team makes, and it's worth being blunt about because it's the strongest argument in this whole piece. They optimize the metrics that are easy to read off a dashboard: token cost, model cost, p95 latency. Those feel like rigor. They aren't. They're inputs. The thing the business actually buys from an AI feature is resolved tasks: tickets closed, deployments correctly judged, documents correctly extracted, questions correctly answered without a human stepping in. None of the input metrics tell you whether that happened. You can cut token cost 40% by switching to a smaller model and quietly drop your resolution rate from 80% to 55%, and your cost dashboard will show a win while your feature gets materially worse. Every retry, every human escalation, every wrong answer someone has to catch and redo is a cost the per-call metric never sees.
So measure the thing you sell. Resolution rate, or successful automation rate, per task type. Then cost per completed task, which is total spend divided by the things that actually worked, retries and escalations included. That single number is the honest one, and it's usually the one nobody is tracking. A change that triples tokens per call but doubles resolution and halves retries is a large win that every input metric will tell you is a loss. If your reporting can't show that, your reporting is lying to you in the direction of looking frugal.
The trap to avoid
The cleanest tell that you're measuring the wrong layer: you can drive tokens per call to zero by making the feature do nothing. Any metric a broken feature can ace is not the metric. Instrument outcomes, not calls. A dashboard showing token counts but not resolution rate is a cost meter you called observability, the same mistake as 400 dashboards and no answer to the health question, pointed at the model instead of the fleet.
The minimum bar for a production AI feature
The slop pattern is simple: wrap a model call in a UI, call it a feature, ship it, never measure it. A real AI feature clears the same bar as anything else you run in production. Six checks. Miss one and it isn't done.
Defined task, defined success
Not "an AI assistant." A specific job with a measurable did-it-work signal you can put on a dashboard and hold someone to.
Context is curated, not dumped
Code decides what the model sees at each step, and you can point to it. "We paste everything and hope" is a funnel into the billing system, not a harness.
Tool output compacted at the boundary
The model never sees a raw blob a human wouldn't read. Every result is shaped before it crosses into the window.
Instrumented on cost per outcome
You answer "is it working and what does working cost" with a query, not a hunch. Tokens-per-call alone is a number that only feels like instrumentation.
Degrades honestly
On a retrieval miss or context overflow it says so or falls back. It does not confidently hallucinate over a context it never actually had.
Has a budget and an alarm
Cost is a reliability metric. A feature with no spend ceiling is an incident waiting for a billing cycle. Put a cost SLO on it and page on burn rate, exactly like error budget.
Most AI cost problems are not model problems. They're systems-design problems wearing a model-shaped mask. Teams blame the thing they bought because it's the visible part, and the harness is where the waste actually lives because it's the part nobody owns. Treat context as free storage and three things rise together: cost, latency, and the rate at which the answers quietly get worse. None of that needs a worse model. It only needs an undisciplined system wrapped around a good one. The window is your ceiling. What you put in it is the whole game.