The problem
How much does heavy Claude Code use actually cost in 2026?
The 2026 pricing for AI-coding tools converged on $20/month as the floor. The ceiling drifted up. April 2026 comparisons from agentdeals.dev, ijonis.com, and nxcode.io put heavy Claude Code use at $100–200/month, Cursor Ultra and Windsurf Max at the same tier, and Devin starting at $20/month with no real free tier.1 The morphllm.com round-up of 15 coding agents from March 2026 puts it bluntly:4 "Heavy Claude Code usage hits $100–200/month. Cursor credits drain unpredictably. Token waste from hallucinations is real money."
Russian-language coverage is consistent. The VC.ru article "Saving tokens — tactics that cut Claude Code token spend" (in Russian) opens with the same number framed in rubles:5 an average developer spends roughly $6/day on Claude Code, which compounds to ~14,000–19,000 RUB/month at a 75–95 RUB/USD rate. None of these write-ups disagree about the bill. They disagree about what to do about it.
The "what to do about it" answers cluster into three categories: switch models, switch tools, and reduce what reaches the model. Switching models is brittle. Switching tools is a high-friction bet. The third option is the one almost nobody implements: change what the agent actually sends to the model on each turn. That third option is where the token-economy stack lives.
Mental model
What are the two layers of LLM token economy?
A useful model for LLM cost optimization has two layers that compound multiplicatively, not interchangeably.
Layer A — provider-native primitives. These ship inside the model API and require no infrastructure. Anthropic prompt caching reduces input cost on cache reads by roughly 90% (with a 5-minute default TTL and 1-hour extended TTL). Tool Search Tool with defer_loading cuts tool-definition token usage by 85% while improving accuracy: Opus 4 climbed from 49% to 74%, Opus 4.5 from 79.5% to 88.1%.6 Anthropic's context management features report 84% token reduction on a 100-turn web-search eval (context editing alone), and the memory tool combined with context editing delivers a 39% accuracy improvement over baseline.7 Anthropic's Message Batches API takes 50% off async work. OpenAI's prompt caching does the same job automatically on repeated prefixes, advertised as 80% latency and 90% input-cost reduction. Google Gemini ships context caching, both implicit (Gemini 2.5+) and explicit (guaranteed savings).
Layer B — pre-model prep. This sits upstream of the model. It reduces what reaches Layer A in the first place. A 30,000-token build log compacted to 3,000 tokens before the model ever sees it. A 100-megabyte screenshot reduced to a 30-kilobyte annotated crop. A repo-wide grep that returns five line-anchored files instead of forty-three full file bodies.
Most "10 tips to save Claude Code tokens" articles cover Layer A. Few cover Layer B, and almost nobody combines them on purpose. The interesting math is in the combination. If Layer B turns a 30k log into 3k, that is a 10× reduction. If Layer A then cache-hits that 3k across the next four iterations, you save another 90% on the cache reads. The two effects multiply.
That multiplication is the actual reason heavy users are surprised by their bills. They are paying for the uncached, unfiltered, full-fat payload their agent shipped to the model the first time around, on every turn, without any pre-model prep. A 17-MCP local stack is one way to fix the Layer B half of that equation.
Pattern · canonical primary source
Why does the Anthropic pattern matter? Code execution as a tool interface
The architectural pattern behind Layer B has a canonical primary source: Anthropic's engineering post Code execution with MCP: Building more efficient agents, published November 4, 2025, by Adam Jones and Conor Kelly.2
Their example is concrete. An agent is asked to download a meeting transcript from Google Drive and attach it to a Salesforce lead. The standard pattern loads both MCP tool definitions upfront, then the model calls each tool in sequence and the full transcript flows through the model context twice — once on read, once on write. For a 2-hour meeting, that is roughly 50,000 extra tokens passing through the context window for no reasoning benefit.
The alternative presents MCP servers as a code API rather than direct tool calls. The agent writes:
import * as gdrive from './servers/google-drive';
import * as salesforce from './servers/salesforce';
const transcript = (await gdrive.getDocument({ documentId: 'abc123' })).content;
await salesforce.updateRecord({
objectType: 'SalesMeeting',
recordId: '00Q5f000001abcXYZ',
data: { Notes: transcript }
});
The tool definitions are discovered on-demand by reading the filesystem (one server per directory, one tool per file). The transcript itself never enters the model context — it lives in the execution environment, gets passed from one call to another in code, and only logs what the agent explicitly chooses to log. The measured result: 150,000 tokens reduced to 2,000 tokens — a 98.7% reduction.2
Cloudflare independently arrived at the same conclusion, branded Code Mode.8 Two implementations, different in detail. One shared pattern: agents are good at writing code, and code is a more efficient interface to a set of tools than direct tool-call syntax.
Anthropic's own internal anecdote, cited in the same engineering posts: their tool definitions consumed 134K tokens before optimization.6 A five-server, 58-tool setup is already at 55K. Adding Jira pushes past 100K. Those tokens are spent before the agent has read a single character of the user's request.
A 17-server stack inherits this problem by default. The interesting design question is what to do about it.
The implementation
What is in the 17-MCP local-first stack?
github.com/g-shevchenko/hwai-mcp-stack is one answer. It is a local-first, MIT-licensed Token Efficiency Platform that bundles 17 MCP servers behind a single install path and a shared MCP config that works across Claude Code, Codex, Cursor, and Windsurf.
The stack is organized by capability family:
| Capability | MCPs |
|---|---|
| Routing (the meta-layer) | HWAI Context Router (router-lite-mcp) |
| Compact repo retrieval before edits | retrieval-mcp, context-prep-mcp |
| Code structure and history | language-graph-mcp, repo-history-mcp |
| Local static checks and quality gates | static-analysis-mcp, repo-quality-gate-mcp |
| Keeping a growing repo clean | repo-hygiene-mcp, docs-hygiene-mcp, docs-sync-mcp |
| Contracts and dependency risk | contract-schema-mcp, dependency-risk-mcp |
| Regression datasets from real misses | golden-dataset-mcp, agent-trace-mcp |
| Browser traces and visual changes | playwright-trace-mcp, vision-mcp, visual-baseline-mcp |
Four install profiles let you scope to the work you actually do: core (6 MCPs, first install), repo (14, large codebase work), browser-debug (10, for Playwright/screenshot workflows), and full (all 17). The full profile is local-only and does not require any API keys — that distinguishes it from most of the commercial alternatives whose first user step is "paste your provider key here."
The routing layer is the conceptually load-bearing piece. The HWAI Context Router (router-lite-mcp) decides, before any frontier model token is spent, whether a prep MCP should be called at all. This is the local-first equivalent of Anthropic's search_tools server-side helper from the Code Execution post: it gives the agent a deterministic trigger gate so that simple questions stay simple and don't pay the tool-stuffing tax.
Measured
What did the public dogfood eval actually measure?
The README publishes a single measured before/after, with caveats. On , a local deterministic eval across 12 reviewed-public tasks compared the baseline path (no stack) against the same agent with the stack enabled:3
- Aggregate context-token reduction: 75.5%
- Aggregate total-token reduction: 70.5%
- Baseline task success: 91.7% → stack task success: 100.0%
- Critical false positives did not increase
- Per-family context reduction ranged from 35.0% to 80.8% across repo hygiene, traces, screenshots, logs, retrieval, and compression
The README's own caveat is reprinted here because it matters: this is internal dogfood evidence on 12 tasks. It is not a leaderboard claim, and it is not a universal percentage. Your savings will depend on repo size, task type, agent behavior, and whether the agent would otherwise paste entire files, logs, or screenshots into context.
What it does suggest is that the prep-first approach can deliver Layer B savings that compound usefully with whatever Layer A primitive you already use. A 75% context reduction at the prep layer, followed by a 90% cache-hit on the reduced payload at the API layer, leaves about 2.5% of the original input token cost on a repeat turn. The exact arithmetic depends on the workload. The direction is consistent.
Cost model
When does the 75.5% reduction actually save money?
The 75.5% is a tokens-saved number. The dollars-saved number depends on three things the headline figure hides: the price tier in use, the wall-clock overhead the prep step adds, and how the reader values developer time. Writing it as a formula:
C = ( Tin × (1 − R) + Tout ) × price + Overhead × wage ⁄ 3.6 × 106
R is the context reduction (a fraction). price is the per-token cost of the chosen Claude tier. Overhead is the wall-clock millisecond cost of the prep step. wage is the value of developer time in dollars per hour. The 3.6 × 106 divisor converts hours-per-millisecond into the per-task denomination of the rest of the equation. The token term shrinks with R; the time term grows with Overhead × wage. Whether MCP saves money is whether the first term shrinks faster than the second term grows.
At the smoke-fixture parameters from the eval harness — R = 77%, Tin ≈ 18,300, Tout ≈ 815, Overhead ≈ 4 s — the per-task saving across the three Claude tiers and two wage assumptions is:
| Tier (input/output $ per Mtok) | Saving at $0/hr (token cost only) | Saving at $100/hr (token + developer time) |
|---|---|---|
| Opus 4 ($15 / $75) | +$0.211 / task (+63%) | +$0.101 / task (+30%) |
| Sonnet 4 ($3 / $15) | +$0.042 / task (+63%) | −$0.068 / task (−101%) |
| Haiku 3.5 ($0.80 / $4) | +$0.011 / task (+63%) | −$0.099 / task (−554%) |
The left column is what the 75.5% reduction translates to in pure token-cost terms: a flat 60–63% dollar saving across all three tiers. The right column exposes the part the headline number hides. Once developer time has any value, the 4-second prep overhead becomes a real line item. On the cheap tiers — where each saved token is worth less — that time cost can exceed the token saving outright.
Equivalently: there is a break-even input size below which the stack costs more than it saves. The break-even is small on Opus, larger on Sonnet, and large enough on Haiku to dominate most everyday tasks:
| Tier | Break-even Tin at $50/hr | Break-even Tin at $100/hr | Break-even Tin at $200/hr |
|---|---|---|---|
| Opus 4 | ~4,800 tokens | ~9,500 tokens | ~19,000 tokens |
| Sonnet 4 | ~23,800 tokens | ~47,600 tokens | ~95,200 tokens |
| Haiku 3.5 | ~89,200 tokens | ~178,400 tokens | ~356,900 tokens |
The practical reading: the stack is a near-pure win at the Opus tier and at any tier when the baseline task is large. It is a defensible call on Sonnet for typical multi-file agentic tasks. It is usually a net loss on Haiku unless the baseline is unusually large or developer time is unusually cheap. The 75.5% context-reduction figure is real and does not change. What changes is the framing: which tier you are on, and how you value your own time, decide whether that 75.5% translates into dollars or into 4-second pauses.
Two honest caveats on this analysis. The R, Tin, and overhead values used in the tables come from a 3-paired-run smoke fixture in the eval harness, not from the full 12-task dogfood measurement — they illustrate the model, not a substitute for the headline measurement. And the price constants are public 2026-05 Anthropic API rates; the same equation applies, with different constants, to other vendors. A reproducibility-grade open eval harness with bootstrap confidence intervals, per-MCP ablation, and a --cost-model projection mode for plugging in any reader's own R, Overhead, and wage values is in active development and will publish alongside the next measured iteration.
Comparison
How does it compare to the named alternatives?
There are eight named classes of tooling in the token-economy adjacent space. The stack overlaps with each, and loses to each on a different axis. The table below positions it directly.
| Class | Named products | Where this stack is different |
|---|---|---|
| Provider-native primitives | Anthropic (caching, Tool Search, code execution, context editing, memory), OpenAI (caching, Batch, Predicted Outputs), Gemini context caching | Runs upstream of the model. Compounds with these — does not replace them. |
| Prompt compression (LLM-based) | Microsoft LLMLingua / LongLLMLingua / LLMLingua-211, Spectyra (OSS) | Parser-first and deterministic, no LLM cost. Less aggressive on dense prose. A complementary trade-off. |
| AI-coding "save tokens" listicles | computingforgeeks (20–43% with 10 tools), mindstudio.ai (5 skills, 70%), Medium 10-tips posts, the Reddit I cut Claude Code's token usage by 68.5% thread | One-command integrated install with a shared MCP contract across four IDEs. Not a bundle of unrelated tips. |
| Repo retrieval / code search | Cursor codebase index ($20/mo), Sourcegraph Cody, Continue, Aider repo-map (OSS), Greptile, Sweep, Augment | retrieval-mcp is deterministic local (ripgrep + path scoring + line anchors), no embeddings in v1 by deliberate scope. Closest twin: Aider repo-map. |
| Web / page extraction | Firecrawl ($16/mo+), Jina Reader (r.jina.ai), Bright Data, Apify, ScrapFly, Browserbase + Stagehand, SerpAPI | context-prep-mcp runs locally, parser-first. Different cost structure. |
| Model routers | Martian (~$1.3B val), OpenRouter auto, RouteLLM (OSS)12, Not Diamond, Unify, Aurelio semantic-router; gateways Portkey, Helicone, LiteLLM, Cloudflare AI Gateway, Vercel AI Gateway | router-lite-mcp is a prep-trigger gate (decide whether to call a prep MCP at all). Different, upstream layer. |
| LLM cost observability | Helicone, Langfuse, Portkey, Lunary, LangSmith, Braintrust | agent-trace-mcp and per-MCP metrics work for local self-host. The commercial edge is hosted dashboards. |
| Visual / browser-trace tooling | Jam, Marker.io, Applitools, Percy, Chromatic TurboSnap | vision-mcp and playwright-trace-mcp are agent-token-budget-optimized. Different consumer (the agent, not a human reviewer). |
Two clarifications worth making for the technical reader. The first is that the stack does not compete with provider-native primitives. They occupy different layers. Pre-model prep reduces the payload before it reaches the model. Caching reduces what you pay for that payload on the second, third, and fourth turns. They are friends.
The second is that the stack does not replace a model router. If you need to dispatch a request to the cheapest sufficient model (Haiku vs Sonnet vs Opus, or routing between providers), that is a Martian-or-RouteLLM-shaped problem. The HWAI Context Router operates upstream of that decision: it decides whether the request needs prep tooling at all, before anything else happens.
Benchmark framing
Where does Terminal-Bench fit? Quality and efficiency are different axes
The natural question after "does this save tokens" is "does it make my agent worse at completing tasks?" The answer to the second question lives on a different benchmark.
Terminal-Bench is a Stanford × Laude Institute collaboration.10 The v1.0 release ships 80 hand-crafted terminal tasks; v2.0 ships 89; v2.1 is current; v3.0 is in development. Tasks include building a Linux kernel from source, configuring a git server connected to a webserver, generating a self-signed OpenSSL certificate with strict requirements, training a fasttext model under accuracy and size constraints, and resharding a c4 data slice with rigorous correctness checks. Each task runs in a Docker container terminal sandbox.
The current leader on the Terminal-Bench 2.0 leaderboard is Claude Sonnet 4.5 at a 0.500 task-resolution score. Other 2026 frontier models cluster below.
What Terminal-Bench measures: whether the agent completed the task.
What it does not measure: how many tokens it spent doing so.
Those are orthogonal axes. A coding-agent stack is not in the same league as Terminal-Bench — it does not improve raw model capability. It is a multiplier on whatever Terminal-Bench score your chosen agent already has. You pick the agent for capability. You pick the prep layer for efficiency. If your agent already solves the task, the prep layer lets it solve the same task with fewer tokens on the next iteration.
That distinction is why "the 17-MCP stack" and "Claude Sonnet 4.5 leads Terminal-Bench" can both be true and both be useful, and neither one makes the other irrelevant.
How to try it
How do you actually install and run it?
Five commands. The order matters — the first three are the inspect-first trust path:
- Clone the repository.
git clone https://github.com/g-shevchenko/hwai-mcp-stack.git - Enter the directory.
cd hwai-mcp-stack - Run the agent preinstall check — confirms expected write targets, scans for forbidden installer patterns, validates the trust manifest.
bash scripts/agent-preinstall-check.sh - Dry-run the installer — shows exactly what will be written, to where, without writing anything.
bash install.sh --dry-run - Install.
bash install.sh
Once that completes, the same install path is also available as a single-line shell command:
HWAI_MCP_PROFILE=full HWAI_MCP_CLIENTS=codex,cursor \
/bin/bash -lc "$(curl -fsSL https://raw.githubusercontent.com/g-shevchenko/hwai-mcp-stack/main/install.sh)"
For repeatable / CI / team-standardized installs, pin to a commit SHA — substitute the SHA you want — the README publishes pinned-install examples:
HWAI_MCP_BRANCH=76540dcfbcd12284fc2b783d22c5c091624eaf82 \
/bin/bash -lc "$(curl -fsSL https://raw.githubusercontent.com/g-shevchenko/hwai-mcp-stack/76540dcfbcd12284fc2b783d22c5c091624eaf82/install.sh)"
Requirements: macOS or Linux shell, git, Node.js, and npm. After install, restart Claude Code, Codex, Cursor, or Windsurf, or open a fresh chat so the stdio MCP configs reload.
The first thing to do after install is run the bundled doctor to confirm all 17 services are healthy:
~/.hwai/hwai-mcp-stack/mcp/bin/hwai-mcp.mjs doctor \
--manifest=~/.hwai/hwai-mcp-stack/mcp/manifest.json \
--source-root=~/.hwai/hwai-mcp-stack/mcp/source \
--profile=full
The expected output for the full profile is services: 17, ok: 17, needs_attention: 0, warnings: 0.
For trust verification before any install, read TRUST.md, VERIFY_BEFORE_INSTALL.md, and the machine-readable trust/hwai-mcp-stack.trust.json. The agent preinstall script encodes the same checks in executable form.
Limits
What does this not do?
Five honest limits.
- It is not a coding agent. Claude Code, Codex, Cursor, and Windsurf are the agents. The stack is the prep layer they call.
- It does not replace prompt caching, the Batch API, or context editing. Those are Layer A. The stack is Layer B. They compound.
- It does not replace human review on architecture-level decisions. Token savings are not a substitute for engineering judgment.
- It does not claim to make any model smarter. Terminal-Bench scores depend on the agent and model you pick. The stack is orthogonal to that.
- The 75.5% number is the README's own dogfood across 12 reviewed-public tasks. It is an internal eval, not an industry leaderboard. The README explicitly says "do not claim a universal percentage reduction from the public README." Savings depend on repo size, task type, and how much raw context your agent would otherwise paste in. The right next step is to install the stack and measure your own before / after on your own workload.
Methods upgrade
How is this being upgraded from engineering note to research paper?
The headline 75.5% comes from an internal dogfood eval (N=12, single rater = author, single codebase). That is honest engineering evidence, not a publishable measurement. Treating the article as a research paper means closing the gap between the two. Seven gaps are tracked explicitly, each with a concrete mitigation plan and the layer of methodology it changes:
| Gap | What the current eval lacks | How the next iteration closes it |
|---|---|---|
| G1 — independence | The author measured the author's own stack. No blinding. | Every response scored by two LLM judges of different ancestry (Anthropic Claude Sonnet + an OpenAI-class model via the open gateway). Cohen's κ between the two judges reported. Human raters invited at replication (P4) rather than blocked on availability. |
| G2 — statistical power | N=12 paired tasks, no confidence intervals, no p-values. | Expansion to N=60 paired tasks. Bootstrap 95% CI on every aggregate metric (10 000 resamples). Welch's two-tailed t-test on per-task reductions. Target p < 0.05 on the primary metric. |
| G3 — ablation | The 12-task result conflates all 17 MCPs into one number. Per-MCP contribution unknown. | Per-profile ablation matrix: none / core / repo / full on every task, plus per-MCP toggle for the top suspected contributors. Output: minimum viable profile that preserves ≥90% of the headline reduction. |
| G4 — task selection bias | 12 tasks chosen by the stack author. Plausibly favored the stack. | Tasks drawn from external public OSS codebases the author does not maintain. Acceptance criteria written before any agent runs. Held out from any prior tuning. |
| G5 — single codebase | All 12 tasks ran against the author's monorepo. | Golden task set covers three independent codebases (one Python web framework, one Python SDK, one JavaScript library) drawn from different ecosystems and authors. |
| G5b — cost & latency | Tokens were measured, dollars were not. Overhead was not folded into the headline claim. | Done — see the cost-model section above. Formal equation, three-tier price table, break-even-input-size table, projection mode for any reader's own parameters. |
| G6 — comparison baseline | Only "same agent, stack on vs off" was compared. | Three additional baselines on the same task set: no MCPs at all, web-fetch only, and a hand-crafted prompt without MCP context. Cursor and Windsurf as fourth and fifth baselines in a follow-up release. |
| G7 — methodology not published | Eval procedure described in prose, not reproducible from the article alone. | Eval harness, golden task set, judge prompts, statistical helpers, and price tables all open-sourced as one repository. A --cost-model projection mode lets any reader plug in their own R, Overhead, and wage assumptions without rerunning the full eval. |
What is already shipped: the formal cost model and break-even analysis (G5b), and the harness itself with bootstrap CI, Welch's t-test, Cohen's κ, per-MCP ablation, LLM-judge integration, and the price-tier table. What remains for the next iteration: data collection on three external codebases, the per-MCP ablation run, two-judge agreement, and the publication of the harness as a standalone repository for independent replication. The reader-facing claim is unchanged: the prep-first approach measurably reduces input tokens. The framing around that claim is what is being upgraded.
Preview — first-iteration pilot data (2026-05-20, N=17, cross-vendor). The harness ran an ablation on a non-Anthropic agent (Codex with the same MCP stack vs the same agent without it) on seventeen external-codebase tasks — sample size now comparable to the article's 12-task internal dogfood. The aggregate context-token reduction was −26.09% with a 95% confidence interval of [−53.20%, −2.24%]. The interval is now entirely negative: on this cross-vendor sample MCPs added roughly 26% to the input-token budget on average, not removed. Per-task distribution: five tasks showed real reduction (E001 +58%, D008 +16%, D003 +15%, E003 +13%, E012 +7%); four were flat (±5%); eight showed regression, with one extreme outlier at −181%. Variance is enormous, and the aggregate sign depends entirely on which tasks are sampled. Two caveats stop this from being a refutation of the headline 75.5%: (1) Codex is OpenAI GPT-class, not Claude Code Sonnet-class — different agent, different reasoning style; (2) the baseline here is “same Codex with --ignore-user-config” (no MCPs loaded), not the article's “same Claude Code without the stack” baseline. The 75.5% remains its specific measurement on its specific agent; this preview shows the result is highly task-set-dependent and does not transfer cleanly cross-vendor. Claude Code-specific replication is in progress and will land in a separate update.
FAQ
Frequently asked questions
Q: How much does Claude Code cost per day for a heavy user?
A: The Russian-language VC.ru analysis cites roughly $6/day for an average developer,5 which is consistent with the $100–200/month band from English-language 2026 comparisons.1 Heavy use varies widely with codebase size and how much of your day is in the agent.
Q: Does prompt caching work with MCP tools?
A: Yes. Anthropic's Tool Search Tool was designed specifically not to break prompt caching — that is called out explicitly in the advanced tool use engineering post.6 The two stack cleanly.
Q: Is the 98.7% Anthropic claim real?
A: Yes. It is sourced to the official Anthropic engineering blog with the Google Drive → Salesforce example,2 and Cloudflare independently arrived at the same architectural conclusion (branded Code Mode).8 The 150K → 2K example is in both write-ups.
Q: What is the best free or open-source way to reduce Claude Code tokens?
A: The stack is one such option: MIT-licensed, no API key required for the full profile, local-only by default, one-command install. The official Anthropic cost-management documentation covers the model-side levers — context management, model selection, extended-thinking settings, preprocessing hooks.9 The two are complementary.
Q: How does this compare to a paid product like Cursor's codebase index or Sourcegraph Cody?
A: Different scope. Cursor and Cody bundle a coding agent, a UI, and a hosted retrieval index into one product. This stack is the local prep layer your existing agent calls — you keep your editor and your model choice, and the prep happens on your machine before anything reaches the API.
Q: When should a team skip an MCP prep layer?
A: Skip it when the task is already small, exact, and cheap. The Codex N=9 preview on this page showed high variance: some tasks saved tokens, some were flat, and some got worse because MCP overhead exceeded the savings. Measure by task class instead of assuming an MCP stack is always cheaper.
The invitation
Try the stack, measure your own before and after
If your numbers diverge from the README's, the right next step is an issue on the repository — the measure-before-deploy discipline is built into the project posture.
The repository: github.com/g-shevchenko/hwai-mcp-stack.
Sources
References and source notes
AI Coding Tools Pricing — comparative aggregates.
April 2026 cross-tool comparison. Cited together with ijonis.com and nxcode.io for the $100–200/month heavy-use band.
Source 2 · Anthropic engineering ·Code execution with MCP: Building more efficient agents.
Canonical primary source for the 98.7% reduction number and the Google Drive → Salesforce example. Authors: Adam Jones and Conor Kelly.
Source 3 · hwai-mcp-stack public README · dogfoodThe 17-MCP local-first Token Efficiency Platform under discussion.
MIT-licensed. The 75.5% / 70.5% / 91.7% → 100.0% numbers come from docs/local-dogfood-eval-2026-04-30.md. Caveats and the no-universal-claim language are reprinted from the README itself.
Source 4 · morphllm.com ·We Tested 15 AI Coding Agents (2026).
Industry round-up cited for the "$100–200/month + token waste from hallucinations" framing.
Source 5 · VC.ru · Russian-language landscapeSaving tokens — tactics that cut Claude Code token spend (in Russian).
Cited for the ~$6/day per-developer Claude Code spend reference and the RU-market framing.
Source 6 · Anthropic engineering ·Advanced tool use.
Tool Search Tool 85% reduction, Opus 4 49→74% and Opus 4.5 79.5→88.1% accuracy gains, the 134K tokens-of-tool-defs internal anecdote.
Source 7 · Anthropic ·Context Management.
Context editing 84% reduction on 100-turn web-search eval; memory tool combined with context editing +39% accuracy over baseline.
Source 8 · CloudflareCode Mode.
Cloudflare's independent arrival at the same code-execution-with-MCP pattern. Architectural confirmation alongside Source 2.
Source 9 · Claude Code docsManage costs effectively.
Anthropic's canonical reference for model-side cost levers — context management, model selection, extended-thinking settings, preprocessing hooks.
Source 10 · Stanford × Laude InstituteTerminal-Bench.
The agent quality benchmark for terminal tasks. v1.0 (80 tasks), v2.0 (89 tasks), v2.1 current, v3.0 in development. Claude Sonnet 4.5 currently leads at 0.500 task resolution.
Source 11 · Microsoft ResearchLLMLingua / LongLLMLingua / LLMLingua-2.
The academic / OSS analog to Layer B prompt compression — model-scored token salience. Complementary trade-off vs the parser-first deterministic approach in context-prep-mcp.
Source 12 · lm-sysRouteLLM.
OSS model router. Cited as the reference open-source alternative in the model-router class of comparison.
Related pages