Cut Your Claude Code and Codex Token Usage | Strake

On a metered API key, token waste shows up as a bill. On a subscription plan, Claude Code, Codex, "5x" and "20x" tiers, it shows up as a lockout.

You are halfway through a real task. The agent has the repo in its head, a failing test in front of it, and enough context to make the next edit. Then it stops. No useful warning. No obvious culprit. Just a usage wall.

The frustrating part is that the thing that burned the window was probably not the work itself. It was repeated context, cache misses, tool schemas, oversized models, and long sessions carrying old state forward.

Most token waste in coding agents is mechanical. Fixing it starts with understanding how context gets sent, cached, repeated, and billed.

The mental model

Every turn, your agent sends a working context back to the model: system instructions, tool definitions, prior messages, file excerpts, tool results, and whatever else the agent believes is relevant.

The model does not continue from a private memory of the previous turn. It re-reads a prompt.

So the budget is not "how much did I ask it to do." The budget is:

tokens per turn * number of turns

Waste is the gap between that number and the amount of context the work actually needed.

Here are the leaks I would look for first.

Leak 1: Prompt cache misses

Prompt caching is the highest-leverage optimization because it applies to repeated context, which is exactly what agent sessions are full of.

Providers can cache the stable prefix of a prompt. If the next request starts with the same prefix, the model can read that cached content at a much lower rate. On current Claude Opus pricing, standard input is $5 per million tokens and cache reads are $0.50 per million. Same text, 10x difference. OpenAI's cached-input pricing has the same shape: repeated prefix tokens are much cheaper than fresh input.

That makes cache hit rate one of the first numbers worth checking:

cache_hit_rate = cache_read / (cache_read + cache_creation + uncached_input)

In a long coding session, this should usually be high. The repo summary, system prompt, tool definitions, and earlier conversation are mostly repeated turn to turn. If the cache hit rate is low, you are paying full freight for context the model just saw.

Common ways teams break the cache:

Changing content near the front of the prompt.
Injecting volatile state, timestamps, or changing status text before stable context.
Reordering tool definitions or MCP tool schemas between turns.
Pasting large files into chat instead of letting the agent read them when needed.

The fix is not clever. Keep the front of the prompt stable. Put volatile material later. Do not reshuffle system context every turn. Treat tool definitions like part of your hot path, because they are.

Leak 2: Context bloat

Long context is useful, but it is not free. Every extra chunk you keep in the session gets carried into later turns unless the agent compacts or drops it.

The bad pattern is easy to recognize: one task turns into three tasks, debugging turns into implementation, implementation turns into release notes, and the session still contains the whole archaeology dig from two hours ago.

Compare the average input size of the first five turns with the last five. If the last five are more than 2x larger, and the session is already past roughly 30K tokens, you are probably paying a long-context tax on every remaining turn.

Operational fixes:

Start a new session when the task boundary changes.
Compact before the session is huge, not after the model is already struggling.
Ask the agent to summarize decisions and open questions, then restart from that summary.
Reference files by path. Do not paste thousands of lines into the conversation unless you want to pay for those lines repeatedly.

This is the same discipline as log handling. You do not hand someone a 10,000-line log when grep ERROR would do.

Leak 3: MCP and tool overhead

MCP servers are useful, but they are not invisible. Each server exposes tool definitions. Those definitions sit in context. They get sent again and again.

One or two focused servers is usually fine. A pile of always-on servers can become a fixed tax on every turn before the user has asked for anything. If you connect five MCP servers and each one exposes a dozen tools, you can burn a surprising amount of input budget on tools the agent never calls.

The check is simple:

tool_schema_tokens / total_input_tokens

If tool schemas are a big percentage of the session, especially for tools you did not use, disconnect them. Bring them back when the task needs them.

That is basic capacity management, not an argument against MCP.

Leak 4: Model and reasoning right-sizing

There are two separate problems here.

First, frontier models get used for low-variance work. Formatting JSON, renaming a variable, writing a commit message, and applying a mechanical refactor usually do not need the most expensive model in the stack.

Second, reasoning tokens are easy to ignore because you often do not see them. On models with explicit thinking or reasoning effort, those tokens are billed as output. Output is usually the expensive side of the request.

The answer is routing, but routing with guardrails:

Use the strongest model for ambiguous architecture, hard debugging, and edits where a bad answer is expensive.
Use cheaper models for formatting, extraction, summarization, and mechanical code changes.
Lower reasoning effort for simple tasks.
Track quality before and after the route change.

This is the one leak where the fix can hurt quality. Do not automate it blindly. Start with one workflow and measure whether the cheaper path holds up.

Leak 5: No burn-rate visibility

The last leak is operational, not model-specific.

Most people only learn they were burning too fast when the subscription window closes on them. That is a terrible alert. It fires after the user has already lost the session.

You want rate, not just total:

Tokens this session.
Tokens this hour.
Cache hit rate over the last N turns.
Context size trend.
Reasoning/output ratio.
Tool schema overhead.

A session running at 3x its normal burn rate should be visible while there is still time to change behavior. That is the difference between an alert and a postmortem.

How to see this

The useful data is already local. Claude Code and Codex write per-session token breakdowns to disk: uncached input, cache reads, cache writes, output, and reasoning. I wrote up exactly where those files live and how to read them here: Read your coding agent token usage.

You do not need prompt text to find these leaks. Model names, timestamps, tool usage, and token counts are enough.

If you do not want to write the parser, that is what I built ModelMeter for. The collector reads local agent logs, token counts only, never prompts or keys, and surfaces the metrics that actually change behavior:

Cache hit rate.
Context growth.
MCP/tool overhead.
Reasoning token burn.
Model routing opportunities.
Session burn rate.

npx modelmeter-collect init <your-token>
npx modelmeter-collect

One important caveat: on a subscription plan, you are not paying per token at the margin. Any dollar figure is an API list-price equivalent, not your actual bill. It is useful for scale, but it is not the thing to optimize directly.

On a flat plan, the numbers that matter are token volume, cache hit rate, context growth, and whether you are pacing the usage window. Be suspicious of any tool that tells subscription users, with fake precision, that they "spent $X this month."

The point

Token waste in coding agents is rarely mysterious once you can see it. The usual causes are cache misses, context bloat, idle tool overhead, oversized models, and no burn-rate visibility.

None of those require a new philosophy of prompting. They require instrumentation and a few boring habits.

Start with cache hit rate. If it is bad, fix that first. It is usually the cheapest win in the whole stack.

Free to try at modelmeter.dev.

Five ways your AI coding agent wastes tokens (and how to fix each one)