Claude Code token usage: 75.5% cut with 17 MCPs

Q: How much does Claude Code cost per day for a heavy user?

A: The Russian-language VC.ru analysis cites roughly $6/day for an average developer, which is consistent with the $100–200/month band from English-language 2026 comparisons. Heavy use varies widely with codebase size and how much of your day is in the agent.

Q: Does prompt caching work with MCP tools?

A: Yes. Anthropic's Tool Search Tool was designed specifically not to break prompt caching — that is called out explicitly in the advanced tool use engineering post. The two stack cleanly.

Q: Is the 98.7% Anthropic claim real?

A: Yes. It is sourced to the official Anthropic engineering blog with the Google Drive → Salesforce example, and Cloudflare independently arrived at the same architectural conclusion (branded Code Mode). The 150K → 2K example is in both write-ups.

Q: What is the best free or open-source way to reduce Claude Code tokens?

A: The stack is one such option: MIT-licensed, no API key required for the full profile, local-only by default, one-command install. The official Anthropic cost-management documentation covers the model-side levers — context management, model selection, extended-thinking settings, preprocessing hooks. The two are complementary.

Q: How does this compare to a paid product like Cursor's codebase index or Sourcegraph Cody?

A: Different scope. Cursor and Cody bundle a coding agent, a UI, and a hosted retrieval index into one product. This stack is the local prep layer your existing agent calls — you keep your editor and your model choice, and the prep happens on your machine before anything reaches the API.

Q: When should a team skip an MCP prep layer?

A: Skip it when the task is already small, exact, and cheap. The Codex N=9 preview on this page showed high variance: some tasks saved tokens, some were flat, and some got worse because MCP overhead exceeded the savings. Measure by task class instead of assuming an MCP stack is always cheaper.

The problem

How much does heavy Claude Code use actually cost in 2026?

The 2026 pricing for AI-coding tools converged on $20/month as the floor. The ceiling drifted up. April 2026 comparisons from agentdeals.dev, ijonis.com, and nxcode.io put heavy Claude Code use at $100–200/month, Cursor Ultra and Windsurf Max at the same tier, and Devin starting at $20/month with no real free tier.¹ The morphllm.com round-up of 15 coding agents from March 2026 puts it bluntly:⁴ "Heavy Claude Code usage hits $100–200/month. Cursor credits drain unpredictably. Token waste from hallucinations is real money."

Russian-language coverage is consistent. The VC.ru article "Saving tokens — tactics that cut Claude Code token spend" (in Russian) opens with the same number framed in rubles:⁵ an average developer spends roughly $6/day on Claude Code, which compounds to ~14,000–19,000 RUB/month at a 75–95 RUB/USD rate. None of these write-ups disagree about the bill. They disagree about what to do about it.

The "what to do about it" answers cluster into three categories: switch models, switch tools, and reduce what reaches the model. Switching models is brittle. Switching tools is a high-friction bet. The third option is the one almost nobody implements: change what the agent actually sends to the model on each turn. That third option is where the token-economy stack lives.

Mental model

What are the two layers of LLM token economy?

A useful model for LLM cost optimization has two layers that compound multiplicatively, not interchangeably.

Layer A — provider-native primitives. These ship inside the model API and require no infrastructure. Anthropic prompt caching reduces input cost on cache reads by roughly 90% (with a 5-minute default TTL and 1-hour extended TTL). Tool Search Tool with defer_loading cuts tool-definition token usage by 85% while improving accuracy: Opus 4 climbed from 49% to 74%, Opus 4.5 from 79.5% to 88.1%.⁶ Anthropic's context management features report 84% token reduction on a 100-turn web-search eval (context editing alone), and the memory tool combined with context editing delivers a 39% accuracy improvement over baseline.⁷ Anthropic's Message Batches API takes 50% off async work. OpenAI's prompt caching does the same job automatically on repeated prefixes, advertised as 80% latency and 90% input-cost reduction. Google Gemini ships context caching, both implicit (Gemini 2.5+) and explicit (guaranteed savings).

Layer B — pre-model prep. This sits upstream of the model. It reduces what reaches Layer A in the first place. A 30,000-token build log compacted to 3,000 tokens before the model ever sees it. A 100-megabyte screenshot reduced to a 30-kilobyte annotated crop. A repo-wide grep that returns five line-anchored files instead of forty-three full file bodies.

Most "10 tips to save Claude Code tokens" articles cover Layer A. Few cover Layer B, and almost nobody combines them on purpose. The interesting math is in the combination. If Layer B turns a 30k log into 3k, that is a 10× reduction. If Layer A then cache-hits that 3k across the next four iterations, you save another 90% on the cache reads. The two effects multiply.

That multiplication is the actual reason heavy users are surprised by their bills. They are paying for the uncached, unfiltered, full-fat payload their agent shipped to the model the first time around, on every turn, without any pre-model prep. A 17-MCP local stack is one way to fix the Layer B half of that equation.

Pattern · canonical primary source

Why does the Anthropic pattern matter? Code execution as a tool interface

The architectural pattern behind Layer B has a canonical primary source: Anthropic's engineering post Code execution with MCP: Building more efficient agents, published November 4, 2025, by Adam Jones and Conor Kelly.²

Their example is concrete. An agent is asked to download a meeting transcript from Google Drive and attach it to a Salesforce lead. The standard pattern loads both MCP tool definitions upfront, then the model calls each tool in sequence and the full transcript flows through the model context twice — once on read, once on write. For a 2-hour meeting, that is roughly 50,000 extra tokens passing through the context window for no reasoning benefit.

The alternative presents MCP servers as a code API rather than direct tool calls. The agent writes:

import * as gdrive from './servers/google-drive';
import * as salesforce from './servers/salesforce';

const transcript = (await gdrive.getDocument({ documentId: 'abc123' })).content;
await salesforce.updateRecord({
  objectType: 'SalesMeeting',
  recordId: '00Q5f000001abcXYZ',
  data: { Notes: transcript }
});

The tool definitions are discovered on-demand by reading the filesystem (one server per directory, one tool per file). The transcript itself never enters the model context — it lives in the execution environment, gets passed from one call to another in code, and only logs what the agent explicitly chooses to log. The measured result: 150,000 tokens reduced to 2,000 tokens — a 98.7% reduction.²

Cloudflare independently arrived at the same conclusion, branded Code Mode.⁸ Two implementations, different in detail. One shared pattern: agents are good at writing code, and code is a more efficient interface to a set of tools than direct tool-call syntax.

Anthropic's own internal anecdote, cited in the same engineering posts: their tool definitions consumed 134K tokens before optimization.⁶ A five-server, 58-tool setup is already at 55K. Adding Jira pushes past 100K. Those tokens are spent before the agent has read a single character of the user's request.

A 17-server stack inherits this problem by default. The interesting design question is what to do about it.

The implementation

What is in the 17-MCP local-first stack?

github.com/g-shevchenko/hwai-mcp-stack is one answer. It is a local-first, MIT-licensed Token Efficiency Platform that bundles 17 MCP servers behind a single install path and a shared MCP config that works across Claude Code, Codex, Cursor, and Windsurf.

The stack is organized by capability family:

Capability	MCPs
Routing (the meta-layer)	HWAI Context Router (router-lite-mcp)
Compact repo retrieval before edits	retrieval-mcp, context-prep-mcp
Code structure and history	language-graph-mcp, repo-history-mcp
Local static checks and quality gates	static-analysis-mcp, repo-quality-gate-mcp
Keeping a growing repo clean	repo-hygiene-mcp, docs-hygiene-mcp, docs-sync-mcp
Contracts and dependency risk	contract-schema-mcp, dependency-risk-mcp
Regression datasets from real misses	golden-dataset-mcp, agent-trace-mcp
Browser traces and visual changes	playwright-trace-mcp, vision-mcp, visual-baseline-mcp

Four install profiles let you scope to the work you actually do: core (6 MCPs, first install), repo (14, large codebase work), browser-debug (10, for Playwright/screenshot workflows), and full (all 17). The full profile is local-only and does not require any API keys — that distinguishes it from most of the commercial alternatives whose first user step is "paste your provider key here."

The routing layer is the conceptually load-bearing piece. The HWAI Context Router (router-lite-mcp) decides, before any frontier model token is spent, whether a prep MCP should be called at all. This is the local-first equivalent of Anthropic's search_tools server-side helper from the Code Execution post: it gives the agent a deterministic trigger gate so that simple questions stay simple and don't pay the tool-stuffing tax.

Measured

What did the public dogfood eval actually measure?

The README publishes a single measured before/after, with caveats. On 30 Apr 2026, a local deterministic eval across 12 reviewed-public tasks compared the baseline path (no stack) against the same agent with the stack enabled:³

Aggregate context-token reduction: 75.5%
Aggregate total-token reduction: 70.5%
Baseline task success: 91.7% → stack task success: 100.0%
Critical false positives did not increase
Per-family context reduction ranged from 35.0% to 80.8% across repo hygiene, traces, screenshots, logs, retrieval, and compression

The README's own caveat is reprinted here because it matters: this is internal dogfood evidence on 12 tasks. It is not a leaderboard claim, and it is not a universal percentage. Your savings will depend on repo size, task type, agent behavior, and whether the agent would otherwise paste entire files, logs, or screenshots into context.

What it does suggest is that the prep-first approach can deliver Layer B savings that compound usefully with whatever Layer A primitive you already use. A 75% context reduction at the prep layer, followed by a 90% cache-hit on the reduced payload at the API layer, leaves about 2.5% of the original input token cost on a repeat turn. The exact arithmetic depends on the workload. The direction is consistent.

Cost model

When does the 75.5% reduction actually save money?

The 75.5% is a tokens-saved number. The dollars-saved number depends on three things the headline figure hides: the price tier in use, the wall-clock overhead the prep step adds, and how the reader values developer time. Writing it as a formula:

C = ( T_in × (1 − R) + T_out ) × price + Overhead × wage ⁄ 3.6 × 10⁶

R is the context reduction (a fraction). price is the per-token cost of the chosen Claude tier. Overhead is the wall-clock millisecond cost of the prep step. wage is the value of developer time in dollars per hour. The 3.6 × 10⁶ divisor converts hours-per-millisecond into the per-task denomination of the rest of the equation. The token term shrinks with R; the time term grows with Overhead × wage. Whether MCP saves money is whether the first term shrinks faster than the second term grows.

At the smoke-fixture parameters from the eval harness — R = 77%, T_in ≈ 18,300, T_out ≈ 815, Overhead ≈ 4 s — the per-task saving across the three Claude tiers and two wage assumptions is:

Tier (input/output $ per Mtok)	Saving at $0/hr (token cost only)	Saving at $100/hr (token + developer time)
Opus 4 ($15 / $75)	+$0.211 / task (+63%)	+$0.101 / task (+30%)
Sonnet 4 ($3 / $15)	+$0.042 / task (+63%)	−$0.068 / task (−101%)
Haiku 3.5 ($0.80 / $4)	+$0.011 / task (+63%)	−$0.099 / task (−554%)

The left column is what the 75.5% reduction translates to in pure token-cost terms: a flat 60–63% dollar saving across all three tiers. The right column exposes the part the headline number hides. Once developer time has any value, the 4-second prep overhead becomes a real line item. On the cheap tiers — where each saved token is worth less — that time cost can exceed the token saving outright.

Equivalently: there is a break-even input size below which the stack costs more than it saves. The break-even is small on Opus, larger on Sonnet, and large enough on Haiku to dominate most everyday tasks:

Tier	Break-even T_in at $50/hr	Break-even T_in at $100/hr	Break-even T_in at $200/hr
Opus 4	~4,800 tokens	~9,500 tokens	~19,000 tokens
Sonnet 4	~23,800 tokens	~47,600 tokens	~95,200 tokens
Haiku 3.5	~89,200 tokens	~178,400 tokens	~356,900 tokens

The practical reading: the stack is a near-pure win at the Opus tier and at any tier when the baseline task is large. It is a defensible call on Sonnet for typical multi-file agentic tasks. It is usually a net loss on Haiku unless the baseline is unusually large or developer time is unusually cheap. The 75.5% context-reduction figure is real and does not change. What changes is the framing: which tier you are on, and how you value your own time, decide whether that 75.5% translates into dollars or into 4-second pauses.

Two honest caveats on this analysis. The R, T_in, and overhead values used in the tables come from a 3-paired-run smoke fixture in the eval harness, not from the full 12-task dogfood measurement — they illustrate the model, not a substitute for the headline measurement. And the price constants are public 2026-05 Anthropic API rates; the same equation applies, with different constants, to other vendors. A reproducibility-grade open eval harness with bootstrap confidence intervals, per-MCP ablation, and a --cost-model projection mode for plugging in any reader's own R, Overhead, and wage values is in active development and will publish alongside the next measured iteration.

Comparison

How does it compare to the named alternatives?

There are eight named classes of tooling in the token-economy adjacent space. The stack overlaps with each, and loses to each on a different axis. The table below positions it directly.

Class	Named products	Where this stack is different
Provider-native primitives	Anthropic (caching, Tool Search, code execution, context editing, memory), OpenAI (caching, Batch, Predicted Outputs), Gemini context caching	Runs upstream of the model. Compounds with these — does not replace them.
Prompt compression (LLM-based)	Microsoft LLMLingua / LongLLMLingua / LLMLingua-2 (GitHub.com)¹¹, Spectyra (OSS)	Parser-first and deterministic, no LLM cost. Less aggressive on dense prose. A complementary trade-off.
AI-coding "save tokens" listicles	computingforgeeks (20–43% with 10 tools), mindstudio.ai (5 skills, 70%), Medium 10-tips posts, the Reddit I cut Claude Code's token usage by 68.5% thread	One-command integrated install with a shared MCP contract across four IDEs. Not a bundle of unrelated tips.
Repo retrieval / code search	Cursor codebase index ($20/mo), Sourcegraph Cody, Continue, Aider repo-map (OSS), Greptile, Sweep, Augment	retrieval-mcp is deterministic local (ripgrep + path scoring + line anchors), no embeddings in v1 by deliberate scope. Closest twin: Aider repo-map.
Web / page extraction	Firecrawl ($16/mo+), Jina Reader (r.jina.ai), Bright Data, Apify, ScrapFly, Browserbase + Stagehand, SerpAPI	context-prep-mcp runs locally, parser-first. Different cost structure.
Model routers	Martian (~$1.3B val), OpenRouter auto, RouteLLM (GitHub.com) (OSS)¹², Not Diamond, Unify, Aurelio semantic-router; gateways Portkey, Helicone, LiteLLM, Cloudflare AI Gateway, Vercel AI Gateway	router-lite-mcp is a prep-trigger gate (decide whether to call a prep MCP at all). Different, upstream layer.
LLM cost observability	Helicone, Langfuse, Portkey, Lunary, LangSmith, Braintrust	agent-trace-mcp and per-MCP metrics work for local self-host. The commercial edge is hosted dashboards.
Visual / browser-trace tooling	Jam, Marker.io, Applitools, Percy, Chromatic TurboSnap	vision-mcp and playwright-trace-mcp are agent-token-budget-optimized. Different consumer (the agent, not a human reviewer).

Two clarifications worth making for the technical reader. The first is that the stack does not compete with provider-native primitives. They occupy different layers. Pre-model prep reduces the payload before it reaches the model. Caching reduces what you pay for that payload on the second, third, and fourth turns. They are friends.

The second is that the stack does not replace a model router. If you need to dispatch a request to the cheapest sufficient model (Haiku vs Sonnet vs Opus, or routing between providers), that is a Martian-or-RouteLLM-shaped problem. The HWAI Context Router operates upstream of that decision: it decides whether the request needs prep tooling at all, before anything else happens.

A third clarification, on disclosure and reproducibility. The article's framing rests on two axes: byte saving and cache-friendliness (output byte-determinism across runs of the same input). How often do the named alternatives discuss both axes publicly? The table below summarises what each tool's own public documentation claims, as a defence against the “we cherry-picked the lens” critique.

Tool	Public byte-saving claim	Cache-friendliness discussed?	Reproducibility
Anthropic prompt caching	Up to 90% cost reduction on cached tokens (engineering blog).	Yes — this IS the cache layer; the stable-prefix rule is built into the docs.	Hosted; cannot self-verify, but the usage block exposes cache token counts per call.
Microsoft LLMLingua / LLMLingua-2 (GitHub.com)¹¹	Up to 20× compression on QA datasets in the original papers.	Not in the original papers. LLM-in-the-loop compression is by design non-deterministic.	OSS, paper datasets, repeatable; but second-axis verification falls to the user.
Cursor codebase index	Marketing copy on token-waste reduction; no specific number disclosed.	Not in public docs.	Hosted, closed-source; cannot replicate measurements.
Sourcegraph Cody	Marketing copy on context relevance; no specific number disclosed.	Not in public docs.	Partially open-source, hosted index; partial replication possible.
Aider repo-map	Token-budget map; methodology in README + paper.	Not as a primary axis. Map output is deterministic by construction.	OSS, fully replicable on any repo.
Firecrawl / Jina Reader	Marketing-grade claims on cleaner markdown.	Not in public docs.	Mixed: Firecrawl has self-host, Jina Reader is hosted-only.
RouteLLM (GitHub.com)¹² / Martian	Cost reduction via model routing (different axis from compression).	N/A — orthogonal to caching; routes to cheaper models, doesn't reshape the prompt.	RouteLLM OSS + paper; Martian hosted/closed.
This stack	75.5% on a 12-task local dogfood eval (the article's headline figure, with the methodology disclosed in the linked sources).	Yes — second primary axis. Local-first compressors are deterministic by construction; one of them was a counter-example caught and fixed (see “Beyond byte savings” section).	OSS public benchmark tools at g-shevchenko/mcp-token-savers/benchmark (GitHub.com) — bring your own corpus.

The honest summary: the two-axis lens is not original to this article. It is canon in the agents-best-practices (GitHub.com) reference. What is original here is running it through one specific stack and reporting both axes honestly — including the counter-example. Most of the named alternatives publish one axis; only a few publish both. A reader can apply the same two-axis test to any of them with the public benchmark tools linked above.

Benchmark framing

Where does Terminal-Bench fit? Quality and efficiency are different axes

The natural question after "does this save tokens" is "does it make my agent worse at completing tasks?" The answer to the second question lives on a different benchmark.

Terminal-Bench is a Stanford × Laude Institute collaboration.¹⁰ The v1.0 release ships 80 hand-crafted terminal tasks; v2.0 ships 89; v2.1 is current; v3.0 is in development. Tasks include building a Linux kernel from source, configuring a git server connected to a webserver, generating a self-signed OpenSSL certificate with strict requirements, training a fasttext model under accuracy and size constraints, and resharding a c4 data slice with rigorous correctness checks. Each task runs in a Docker container terminal sandbox.

The current leader on the Terminal-Bench 2.0 leaderboard is Claude Sonnet 4.5 at a 0.500 task-resolution score. Other 2026 frontier models cluster below.

What Terminal-Bench measures: whether the agent completed the task.

What it does not measure: how many tokens it spent doing so.

Those are orthogonal axes. A coding-agent stack is not in the same league as Terminal-Bench — it does not improve raw model capability. It is a multiplier on whatever Terminal-Bench score your chosen agent already has. You pick the agent for capability. You pick the prep layer for efficiency. If your agent already solves the task, the prep layer lets it solve the same task with fewer tokens on the next iteration.

That distinction is why "the 17-MCP stack" and "Claude Sonnet 4.5 leads Terminal-Bench" can both be true and both be useful, and neither one makes the other irrelevant.

How to try it

How do you actually install and run it?

Five commands. The order matters — the first three are the inspect-first trust path:

Clone the repository.

git clone https://github.com/g-shevchenko/hwai-mcp-stack.git

Enter the directory.
```
cd hwai-mcp-stack
```
Run the agent preinstall check — confirms expected write targets, scans for forbidden installer patterns, validates the trust manifest.
```
bash scripts/agent-preinstall-check.sh
```
Dry-run the installer — shows exactly what will be written, to where, without writing anything.
```
bash install.sh --dry-run
```
Install.
```
bash install.sh
```

Once that completes, the same install path is also available as a single-line shell command:

HWAI_MCP_PROFILE=full HWAI_MCP_CLIENTS=codex,cursor \
  /bin/bash -lc "$(curl -fsSL https://raw.githubusercontent.com/g-shevchenko/hwai-mcp-stack/main/install.sh)"

For repeatable / CI / team-standardized installs, pin to a commit SHA — substitute the SHA you want — the README publishes pinned-install examples:

HWAI_MCP_BRANCH=76540dcfbcd12284fc2b783d22c5c091624eaf82 \
  /bin/bash -lc "$(curl -fsSL https://raw.githubusercontent.com/g-shevchenko/hwai-mcp-stack/76540dcfbcd12284fc2b783d22c5c091624eaf82/install.sh)"

Requirements: macOS or Linux shell, git, Node.js, and npm. After install, restart Claude Code, Codex, Cursor, or Windsurf, or open a fresh chat so the stdio MCP configs reload.

The first thing to do after install is run the bundled doctor to confirm all 17 services are healthy:

~/.hwai/hwai-mcp-stack/mcp/bin/hwai-mcp.mjs doctor \
  --manifest=~/.hwai/hwai-mcp-stack/mcp/manifest.json \
  --source-root=~/.hwai/hwai-mcp-stack/mcp/source \
  --profile=full

The expected output for the full profile is services: 17, ok: 17, needs_attention: 0, warnings: 0.

For trust verification before any install, read TRUST.md (GitHub.com), VERIFY_BEFORE_INSTALL.md (GitHub.com), and the machine-readable trust/hwai-mcp-stack.trust.json (GitHub.com). The agent preinstall script encodes the same checks in executable form.

Reproducibility

Measure your own stack with the same harness

The strongest defence against the “your benchmark, your numbers, can't reproduce” critique is to ship the harness. The benchmark/ directory (GitHub.com) in the public stack repository contains three small Python primitives (stdlib only, no network, no external dependencies) that let any reader measure any compressor under the same two-axis lens used here:

c2_benchmark.py — byte-saving primitive with bar verdict (CV + ratio thresholds, tunable)
cache_metrics.py — cache-friendliness primitive (md5-stability across N runs)
anti_pattern_audit.py — 12-DSA-anti-pattern grep audit against any source tree

Plus a CLI runner (run_bench.py), 5 neutral example fixtures (prose, log, markdown, JSON, stack trace), and 23 unit tests. Wrap any compressor as a (str) -> str function and run:

git clone https://github.com/g-shevchenko/mcp-token-savers.git
cd mcp-token-savers/benchmark
python3 run_bench.py --compressor first200 --fixtures examples/fixtures.jsonl --repeat 5
python3 anti_pattern_audit.py --mcp path/to/your-mcp/src
python3 -m pytest tests/ -q

The harness is intentionally minimal: no proprietary fixtures, no measured numbers, no internal MCPs. Bring your own corpus and your own compressor; produce numbers comparable to the ones in this article. The two-axis framing is canon in DenisSergeevitch/agents-best-practices (GitHub.com); the harness is one stripped-down implementation of it.

Limits

What does this not do?

Five honest limits.

It is not a coding agent. Claude Code, Codex, Cursor, and Windsurf are the agents. The stack is the prep layer they call.
It does not replace prompt caching, the Batch API, or context editing. Those are Layer A. The stack is Layer B. They compound.
It does not replace human review on architecture-level decisions. Token savings are not a substitute for engineering judgment.
It does not claim to make any model smarter. Terminal-Bench scores depend on the agent and model you pick. The stack is orthogonal to that.
The 75.5% number is the README's own dogfood across 12 reviewed-public tasks. It is an internal eval, not an industry leaderboard. The README explicitly says "do not claim a universal percentage reduction from the public README." Savings depend on repo size, task type, and how much raw context your agent would otherwise paste in. The right next step is to install the stack and measure your own before / after on your own workload.

Methods upgrade

How is this being upgraded from engineering note to research paper?

The headline 75.5% comes from an internal dogfood eval (N=12, single rater = author, single codebase). That is honest engineering evidence, not a publishable measurement. Treating the article as a research paper means closing the gap between the two. Seven gaps are tracked explicitly, each with a concrete mitigation plan and the layer of methodology it changes:

Gap	What the current eval lacks	How the next iteration closes it
G1 — independence	The author measured the author's own stack. No blinding.	Every response scored by two LLM judges of different ancestry (Anthropic Claude Sonnet + an OpenAI-class model via the open gateway). Cohen's κ between the two judges reported. Human raters invited at replication (P4) rather than blocked on availability.
G2 — statistical power	N=12 paired tasks, no confidence intervals, no p-values.	Expansion to N=60 paired tasks. Bootstrap 95% CI on every aggregate metric (10 000 resamples). Welch's two-tailed t-test on per-task reductions. Target p < 0.05 on the primary metric.
G3 — ablation	The 12-task result conflates all 17 MCPs into one number. Per-MCP contribution unknown.	Per-profile ablation matrix: none / core / repo / full on every task, plus per-MCP toggle for the top suspected contributors. Output: minimum viable profile that preserves ≥90% of the headline reduction.
G4 — task selection bias	12 tasks chosen by the stack author. Plausibly favored the stack.	Tasks drawn from external public OSS codebases the author does not maintain. Acceptance criteria written before any agent runs. Held out from any prior tuning.
G5 — single codebase	All 12 tasks ran against the author's monorepo.	Golden task set covers three independent codebases (one Python web framework, one Python SDK, one JavaScript library) drawn from different ecosystems and authors.
G5b — cost & latency	Tokens were measured, dollars were not. Overhead was not folded into the headline claim.	Done — see the cost-model section above. Formal equation, three-tier price table, break-even-input-size table, projection mode for any reader's own parameters.
G6 — comparison baseline	Only "same agent, stack on vs off" was compared.	Three additional baselines on the same task set: no MCPs at all, web-fetch only, and a hand-crafted prompt without MCP context. Cursor and Windsurf as fourth and fifth baselines in a follow-up release.
G7 — methodology not published	Eval procedure described in prose, not reproducible from the article alone.	Eval harness, golden task set, judge prompts, statistical helpers, and price tables all open-sourced as one repository. A --cost-model projection mode lets any reader plug in their own R, Overhead, and wage assumptions without rerunning the full eval.

What is already shipped: the formal cost model and break-even analysis (G5b), and the harness itself with bootstrap CI, Welch's t-test, Cohen's κ, per-MCP ablation, LLM-judge integration, and the price-tier table. What remains for the next iteration: data collection on three external codebases, the per-MCP ablation run, two-judge agreement, and the publication of the harness as a standalone repository for independent replication. The reader-facing claim is unchanged: the prep-first approach measurably reduces input tokens. The framing around that claim is what is being upgraded.

Preview — the methodology mitigations have started. Three follow-up cohorts (cross-vendor Codex at N=17, gate-default Claude Code at N=3, and matched-conditions Claude Code at N=9) have now been measured on the same eval harness. The full per-cohort comparison — and what it means for the 75.5% headline — is in the Portability section below.

Honest disclosure on a downstream-cache analysis that did not reach significance. A second data layer was probed: provider-side cache_read_input_tokens from the same eval harness, with the question “is the prep layer also lifting downstream cache reuse, not just shrinking the input?” The descriptive means lean slightly toward the stack-enabled condition, but a Welch t-test (n=70 vs n=51, p ≈ 0.32) and a paired t-test on matched tasks (n=40, p ≈ 0.21) fail to reject the null at standard thresholds. Cohen's d is in the negligible range. A larger cohort — roughly n ≈ 500 per condition for 80% power at the observed effect size — would be needed to test the hypothesis with adequate power. The byte-saving finding (the article's headline) is independent of this and remains supported by the deterministic compressor measurements. The downstream-cache claim is, at present, an under-powered descriptive trend — not a measured effect.

Portability

Why the 75.5% number is precise but the conclusion isn't yet portable

The 75.5% headline is a real measurement on a real task set: twelve tasks on this author's own codebase, Claude Code (Sonnet-class) as the agent, with tool calls allowed to fire. The arithmetic is reproducible from the eval harness in this repository. What it cannot do, at N=12, is generalize. Three cohorts of follow-up data, measured under explicit conditions, make the per-task variance visible:

Cohort	Agent & permission mode	N	Per-task mean	Pooled	Median
Original (README dogfood)	Claude Code, interactive (tools fire)	12	+75.5%	not re-measured	not re-measured
A — cross-vendor	Codex (OpenAI), `--full-auto`	17	−26.1%	−9.5%	−2.4%
B — gate-default	Claude Code, `--print` (asks)	2	−6.4%	0.0%	−6.4%
B′ — matched conditions	Claude Code, `--permission-mode bypassPermissions`	6	−80.8%	−2.4%	−16.6%

Read the robust columns, not the per-task mean. A mean of per-task percentage reductions is wildly outlier-sensitive: when the no-MCP agent answers a task in eighteen input tokens but the MCP agent first loads a multi-thousand-token tool catalog, that single task scores roughly −16,000% and swamps the average. That is why the per-task-mean column swings from −26% to −81% while the pooled figure (total tokens saved ÷ total baseline tokens, immune to tiny denominators) and the median task both sit near zero. Measured this way the honest result is that matched-conditions MCP prep changed input tokens by essentially nothing — cohort B′ is pooled −2.4%, median −16.6%, and −2 input tokens per task in absolute terms with a confidence interval that straddles zero. The cross-vendor Codex cohort paid a modest −9.5% pooled penalty, about +16,800 input tokens per task in absolute terms, because Codex loads the catalog into the input window rather than the cache. The N column is corrected too: runs that hit the ten-minute execution cap report zero tokens (a spurious “100% reduction”) and are excluded, which is why the matched cohort is six valid pairs, not nine.

The deeper reason the input-token metric reads near zero: on these tasks the prompt is largely served from Anthropic's prompt cache, so the non-cached input barely differs between conditions. The real token economics live in the cache — the catalog-overhead floor described next — which an input-token-reduction metric does not capture.

Two methodological findings emerged from this measurement.

1. Permission-mode asymmetry. Codex with --full-auto auto-approves tool calls; Claude Code with --print in its default permission mode does not. In a non-interactive runner, gate-default Claude Code simply asks for approval and stops — 8 of 9 cohort-B runs (3 tasks × 3 profiles) opened with “I need approval to fetch”. The R metric is therefore not a property of the MCP stack alone; it is a property of (stack × agent × permission mode × task set). The original 75.5% measurement was taken in an interactive Claude Code session where tool calls fired, which the cohort-B′ row matches by design.

2. A measurable cache-overhead floor exists. The MCP tool catalog itself loads into Anthropic's prompt cache before any tool fires. Per-task delta between profile=none and profile=full ranges from +212k to +638k cache tokens across measured tasks, scaling with task complexity. At Sonnet 3.5 cache-read pricing of $0.30/Mtok, this is on the order of $0.06–$0.19 per task of catalog overhead independent of tool-firing behavior. Per-MCP estimate: ~30k cache tokens, ~$0.009 per task per MCP. This floor is the empirical complement to the cost-model section above: at small task sizes, the catalog floor dominates whatever savings tool use would have produced.

What this means for the headline. The 75.5% measurement remains valid as what was measured. But three follow-up cohorts at sample sizes similar to (or larger than) the original show per-task variance from −355.6% to +58.2%, with confidence intervals that span both substantial savings and substantial regression. The honest conclusion is that N=12 was almost certainly too small for a portable aggregate, and the next iteration of this work focuses on getting N≥30 paired tasks measured under matched conditions, with the harness published for independent replication. The goal is not to defend the 75.5%; it is to give readers the data they need to decide whether the stack will save them tokens on their workload.

Reproducibility note. All eval data above is available in this repo at scripts/mcp-token-eval/results/ (Codex cohort A), scripts/mcp-token-eval/results-cc/ (gate-default cohort B), and scripts/mcp-token-eval/results-cc-bypass/ (matched-conditions cohort B′). The harness is at scripts/mcp-token-eval/harness.mjs and the 60-task golden set at scripts/mcp-token-eval/tasks/golden-v0.2.json. Confidence intervals are bootstrap estimates (10,000 resamples) computed with a fixed seed, so they reproduce exactly; re-run with a different MCPEVAL_BOOTSTRAP_SEED to confirm an interval is stable rather than an artifact of one resample stream.

Second primary axis

Beyond byte savings: cache-friendliness

Byte savings is necessary but not sufficient. The second metric that decides production cost is whether the compressor's output is byte-identical across runs of the same input. Identical output lets the downstream provider's prefix cache reuse work from prior turns — turning a measured byte saving into a real cost reduction. Non-deterministic output defeats the cache: every turn looks like a fresh prompt to the provider and pays the full prefill again, often eating the byte saving outright.

This framing is canon in the agents-best-practices (GitHub.com) reference (MIT, 1k+ stars, provider-neutral synthesis of OpenAI / Anthropic / MCP guidance). Their core rule: stable prefix, dynamic suffix. Tool definitions and static instructions appear first in deterministic order; dynamic runtime state (current timestamp, request ID, fresh observations) appears at the end. Any volatile value injected before a stable block destroys the cache for every downstream turn that shares the prefix.

The local-first compressors used here are deterministic by construction. Same input + same parameters → byte-identical output, every run. That is the necessary condition for downstream cache reuse, and it composes with the byte-saving measurement rather than trading against it. Implementations that satisfy this include parser-first text compaction, regex-driven fact extraction, and extractive section selection — anything that does not place an LLM in the compression path. Stochastic compressors that re-rank or re-summarise via a model call can hit higher peak byte savings on a single run, but break the second axis: their output varies across runs and the cache reuse drops to zero.

Twelve well-documented anti-patterns can destroy cache hit rate even when the compressor itself is deterministic. They include placing timestamps or request IDs in the stable prefix, randomising tool order, rewriting conversation history every turn, re-summarising the whole session, and changing schema formatting without versioning. The full enumerable list lives in the cache & cost reference (GitHub.com); audit any tool surface that ships a system prompt or stable tool description against it.

A concrete example from our own stack made this concrete enough to fix. One of the local-first compressors we measured turned out to be a counter-example to its own thesis: it was the highest-leverage byte-saver of every candidate we tested, by a wide margin, but two runs of the same query on the same workspace produced different bytes. The culprit was unsurprising once we localised it — an underlying file-discovery tool didn't guarantee a stable output order, and that order was leaking through a Map insertion sequence into the final compact context, changing both the set and the order of references between runs. A surgical two-line patch (sort the file list before slicing; sort the Map entries by path before the final slice) produced byte-identical output across runs without changing the byte saving. The compressor moved from “biggest saver, lowest cache-friendly” to “biggest saver and fully cache-friendly.” The lesson generalises: the right test for any compressor in your stack is the same two-axis measurement, applied first to your own components.

The honest claim therefore has two axes, not one: byte savings (Layer B compaction) and output determinism (the necessary condition for cache reuse). A compressor that wins one but loses the other is not a production win — it is a single-axis benchmark that hides the other half of the cost.

The formal definitions of those two axes, the cost-model theorem deriving the exact condition under which a moderate-byte-saving high-cache-friendly compressor beats a high-byte-saving low-cache-friendly one, and the statistical analysis with proper confidence intervals on both axes live in the methods/ directory (GitHub.com) of the public benchmark repo. That directory includes the Wilson 95% CI on the cache-friendly score (which is wider than the point estimate suggests at this sample size), a cluster-bootstrap CI on byte-saving, a pre-registered protocol for the next-round expansion, a Gebru et al. datasheet for the corpus, and a Dockerfile for turnkey reproduction. Anyone evaluating these claims rigorously — or running the benchmark on their own stack — should start there.

FAQ

Frequently asked questions

Q: How much does Claude Code cost per day for a heavy user?

A: The Russian-language VC.ru analysis cites roughly $6/day for an average developer,⁵ which is consistent with the $100–200/month band from English-language 2026 comparisons.¹ Heavy use varies widely with codebase size and how much of your day is in the agent.

Q: Does prompt caching work with MCP tools?

A: Yes. Anthropic's Tool Search Tool was designed specifically not to break prompt caching — that is called out explicitly in the advanced tool use engineering post.⁶ The two stack cleanly.

Q: Is the 98.7% Anthropic claim real?

A: Yes. It is sourced to the official Anthropic engineering blog with the Google Drive → Salesforce example,² and Cloudflare independently arrived at the same architectural conclusion (branded Code Mode).⁸ The 150K → 2K example is in both write-ups.

Q: What is the best free or open-source way to reduce Claude Code tokens?

A: The stack is one such option: MIT-licensed, no API key required for the full profile, local-only by default, one-command install. The official Anthropic cost-management documentation covers the model-side levers — context management, model selection, extended-thinking settings, preprocessing hooks.⁹ The two are complementary.

Q: How does this compare to a paid product like Cursor's codebase index or Sourcegraph Cody?

A: Different scope. Cursor and Cody bundle a coding agent, a UI, and a hosted retrieval index into one product. This stack is the local prep layer your existing agent calls — you keep your editor and your model choice, and the prep happens on your machine before anything reaches the API.

Q: When should a team skip an MCP prep layer?

A: Skip it when the task is already small, exact, and cheap. The Codex N=9 preview on this page showed high variance: some tasks saved tokens, some were flat, and some got worse because MCP overhead exceeded the savings. Measure by task class instead of assuming an MCP stack is always cheaper.

The invitation

Try the stack, measure your own before and after

If your numbers diverge from the README's, the right next step is an issue on the repository — the measure-before-deploy discipline is built into the project posture.

The repository: github.com/g-shevchenko/hwai-mcp-stack.

Sources

References and source notes

Source 1 · agentdeals.dev · 2026 pricing landscape

AI Coding Tools Pricing — comparative aggregates.

April 2026 cross-tool comparison. Cited together with ijonis.com and nxcode.io for the $100–200/month heavy-use band.

Source 2 · Anthropic engineering · 4 Nov 2025

Code execution with MCP: Building more efficient agents.

Canonical primary source for the 98.7% reduction number and the Google Drive → Salesforce example. Authors: Adam Jones and Conor Kelly.

Source 3 · hwai-mcp-stack public README · 30 Apr 2026 dogfood

The 17-MCP local-first Token Efficiency Platform under discussion.

MIT-licensed. The 75.5% / 70.5% / 91.7% → 100.0% numbers come from docs/local-dogfood-eval-2026-04-30.md. Caveats and the no-universal-claim language are reprinted from the README itself.

(GitHub.com) Source 4 · morphllm.com · 1 Mar 2026

We Tested 15 AI Coding Agents (2026).

Industry round-up cited for the "$100–200/month + token waste from hallucinations" framing.

Source 5 · VC.ru · Russian-language landscape

Saving tokens — tactics that cut Claude Code token spend (in Russian).

Cited for the ~$6/day per-developer Claude Code spend reference and the RU-market framing.

Source 6 · Anthropic engineering · 24 Nov 2025

Advanced tool use.

Tool Search Tool 85% reduction, Opus 4 49→74% and Opus 4.5 79.5→88.1% accuracy gains, the 134K tokens-of-tool-defs internal anecdote.

Source 7 · Anthropic · 29 Sep 2025

Context Management.

Context editing 84% reduction on 100-turn web-search eval; memory tool combined with context editing +39% accuracy over baseline.

Source 8 · Cloudflare

Code Mode.

Cloudflare's independent arrival at the same code-execution-with-MCP pattern. Architectural confirmation alongside Source 2.

Source 9 · Claude Code docs

Manage costs effectively.

Anthropic's canonical reference for model-side cost levers — context management, model selection, extended-thinking settings, preprocessing hooks.

Source 10 · Stanford × Laude Institute

Terminal-Bench.

The agent quality benchmark for terminal tasks. v1.0 (80 tasks), v2.0 (89 tasks), v2.1 current, v3.0 in development. Claude Sonnet 4.5 currently leads at 0.500 task resolution.

Source 11 · Microsoft Research

LLMLingua / LongLLMLingua / LLMLingua-2.

The academic / OSS analog to Layer B prompt compression — model-scored token salience. Complementary trade-off vs the parser-first deterministic approach in context-prep-mcp.

(GitHub.com) Source 12 · lm-sys

RouteLLM.

OSS model router. Cited as the reference open-source alternative in the model-router class of comparison.

(GitHub.com)

Republished on Medium

Read and share the Medium.com version

Discuss on LinkedIn

Read the LinkedIn.com cross-post and join the thread

Прочитать на VC.ru

Русская версия и обсуждение на VC.ru

Discuss on r/ClaudeAI

Join the thread on r/ClaudeAI

How I cut my Claude Code token usage by 75.5% with 17 local MCPs

What this page covers

What to cite from this page

How much does heavy Claude Code use actually cost in 2026?

What are the two layers of LLM token economy?

Why does the Anthropic pattern matter? Code execution as a tool interface

What is in the 17-MCP local-first stack?

What did the public dogfood eval actually measure?

When does the 75.5% reduction actually save money?

How does it compare to the named alternatives?

Where does Terminal-Bench fit? Quality and efficiency are different axes

How do you actually install and run it?

Measure your own stack with the same harness

What does this not do?

How is this being upgraded from engineering note to research paper?

Why the 75.5% number is precise but the conclusion isn't yet portable

Beyond byte savings: cache-friendliness

Frequently asked questions

Q: How much does Claude Code cost per day for a heavy user?

Q: Does prompt caching work with MCP tools?

Q: Is the 98.7% Anthropic claim real?

Q: What is the best free or open-source way to reduce Claude Code tokens?

Q: How does this compare to a paid product like Cursor's codebase index or Sourcegraph Cody?

Q: When should a team skip an MCP prep layer?

Try the stack, measure your own before and after

References and source notes

How I cut my Claude Code token usage by 75.5% with 17 local MCPs

What this page covers

What to cite from this page

How much does heavy Claude Code use actually cost in 2026?

What are the two layers of LLM token economy?

Why does the Anthropic pattern matter? Code execution as a tool interface

What is in the 17-MCP local-first stack?

What did the public dogfood eval actually measure?

When does the 75.5% reduction actually save money?

How does it compare to the named alternatives?

Where does Terminal-Bench fit? Quality and efficiency are different axes

How do you actually install and run it?

Measure your own stack with the same harness

What does this not do?

How is this being upgraded from engineering note to research paper?

Why the 75.5% number is precise but the conclusion isn't yet portable

Beyond byte savings: cache-friendliness

Frequently asked questions

Q: How much does Claude Code cost per day for a heavy user?

Q: Does prompt caching work with MCP tools?

Q: Is the 98.7% Anthropic claim real?

Q: What is the best free or open-source way to reduce Claude Code tokens?

Q: How does this compare to a paid product like Cursor's codebase index or Sourcegraph Cody?

Q: When should a team skip an MCP prep layer?

Try the stack, measure your own before and after

References and source notes

Where this page connects inside the site