Research synthesis · Published 27 May 2026 · Builds on part 1 and part 2

When MCPs save tokens (and when they don't): a measurement framework for agentic stacks.

On large agentic tasks — sessions where the baseline output runs 5,000 tokens or more — MCPs save 40-55% of tokens on the cases we measured. On small tasks under 2,000 tokens, the same MCPs add overhead. The break-even point isn't where most teams' workloads live, and the cost of getting this wrong scales with how many sessions you run per day. This article distills N=100 measured (task, profile) cells across 4 MCP profiles into three reusable frameworks plus a recommendation table you can apply tomorrow.

Author
Gregory Shevchenko
Subject
Three named frameworks for MCP-stack routing — task-size threshold, profile-task fit over profile size, multi-axis evaluation — plus the retraction discipline that earned them.
Headline measurement
N=100 unique (task, profile) cells · 4 profiles (none/core/repo/full) · 26-task golden set · two large-task cases at 5,000+ baseline tokens both saved 40-55%
Best use
A reference for engineers and operators running agentic stacks who need to route sessions by task class and decide which MCPs belong in the default-loaded set

What you'll learn

  1. The task-size threshold rule — why MCPs amortize on large tasks (5,000+ baseline tokens) and add overhead on small tasks, with the per-task table that surfaces the pattern the means hide.
  2. The profile-task fit framework — why a 9-MCP code-tuned profile times out 7/7 on docs-heavy work while completing 7/7 on code work, and how to classify your tasks before picking your profile.
  3. The multi-axis MCP evaluation — three axes (token cost, capability, frequency × alternative) that decide whether to trim an MCP, and why a null token result alone never justifies removal.
  4. The retraction discipline — how a "provisional finding" with a wide 95% CI almost shipped as a directional claim it can't support, and the polarity-guard rule that catches the class.

Lede

Most teams run MCP stacks in the wrong regime.

Most teams running Model Context Protocol stacks (MCPs) are running them in the wrong regime.

On large agentic tasks — sessions where the baseline output runs 5,000 tokens or more — MCPs save 40-55% of tokens on the cases we measured. On small tasks under 2,000 tokens, the same MCPs add overhead. The break-even point isn't where most teams' workloads live, and the cost of getting this wrong scales with how many sessions you run per day.

This article distills what we learned from N=100 measured (task, profile) cells across 4 MCP profiles into three frameworks practitioners can apply to their own stacks. The data came out of a benchmark we ran on our internal agentic infrastructure (cc-eval-runner on a Mac Mini, $0 marginal cost via Pro/Max OAuth). The frameworks are reusable beyond our specific stack.

The headline finding is the task-size threshold. The headline framework is the multi-axis evaluation. We'll get to the methodology, the retraction we had to walk through to earn this, and the practical recommendation table — but the data point worth your time is this:

MCPs don't pay off on average. They pay off on the right kind of task.

Finding

The task-size threshold

When you pair the same task across two MCP profiles and look at token usage on completed runs, a pattern emerges that the averages hide.

At N=17 paired tasks comparing the core profile (6 MCPs — retrieval, context-prep, vision, and three others) vs none (0 MCPs, built-in tools only), the mean difference is +41 tokens. That's a null result. The 95% confidence interval crosses zero. By itself it suggests MCPs neither save nor cost meaningfully.

Sort the same data by the baseline (control-arm) task size:

Task Baseline tokens (none) core tokens Δ %
D0057,2954,243−3,052−42%
D0015,7262,548−3,178−55%
D0132,9544,079+1,125+38%
E0342,5432,812+269+11%
G0011,7022,048+346+20%
G0211,3931,623+230+17%
D0141,216918−298−25%
E0311,2642,192+928+73%
G0241,0821,978+896+83%
… plus 8 more tasks under 1,000 baseline, mostly small overhead

The picture is stark. The two tasks where baseline crossed 5,000 tokens both showed roughly half-the-token reductions — D005 by 42%, D001 by 55%. Of the 15 tasks under 5,000 baseline, only one (D014) saved tokens, and that one was a small absolute amount. The other 14 either matched or added overhead.

The same pattern holds against the repo profile (9 MCPs, code-tuned): D001 at 5,726 baseline went to 3,384 with repo (−41%); D005 went to 5,987 (−18%). Small tasks: overhead.

Why does the mean wash this out? Two large savers plus fifteen small-task overhead, averaged together, look like a wash. The signal is in the tail, not the center.

This pattern is anecdotal at N=2 large tasks — we cannot claim the 40-55% number with statistical confidence yet. What we can claim with confidence is the direction: large agentic sessions are where MCPs earn their token budget, and we need to keep measuring to tighten the bound.

The first framework follows directly.

Methodology

Methodology (in brief)

This is Part 3 of a measurement series. The full methodology lives in part 11 (original framing and infrastructure) and part 22 (bootstrap CI methodology, scraper-stack AB tests, and the eval_rigor stdlib-only Python module we use for the statistics). This part extends the methodology with a 4-profile ablation on 26 unique golden tasks.

The bullet-summary of the setup:

  • Profiles tested: none (0 MCPs), core (6), repo (9, code-tuned), full (17)
  • Tasks: 26-task golden set covering document reads (D-tier), structured extraction (E-tier), shell/git operations (G-tier)
  • Per task: one claude --print call. Tokens from Anthropic API response. 1,200-second timeout. No retries.
  • Cost: $0 marginal via Pro/Max OAuth on a Mac Mini (the cc-eval-runner infrastructure described in part 1)
  • Total measurements: 100 unique (task, profile) cells; some tasks ran in only a subset of profiles
  • Statistics: stdlib-only Python (eval_rigor module). Wilson 95% CIs per arm; bootstrap CIs (5,000 iterations, seed 42 for determinism); Welch's t-test + Cohen's d for between-group continuous metrics

Everything is reproducible from the raw JSONs in our public repo. The exact regeneration command appears in the reproducibility section at the bottom of this article.

The frameworks

Three frameworks practitioners can adopt

The data above isn't reusable until it's wrapped in decision frameworks. Three named frameworks survived the larger-N expansion. Each is independently useful even if you never run our specific benchmark.

Framework 1 — The task-size threshold rule

Statement: MCP value is non-linear with respect to task complexity. Below approximately 2,000 baseline tokens, MCPs add overhead (description cost + startup latency exceed any saving). Above approximately 5,000 baseline tokens, MCPs typically save 40-55% of tokens. Between 2,000 and 5,000 the data is noisy.

Why this happens: the per-session cost of loading an MCP into context is roughly constant (the tool description tokens). The per-session benefit scales with how much work the agent ends up doing — more retrieval lookups, more file reads, more synthesis steps. Small tasks finish before the agent invokes more than 1-2 tools; the description cost dominates. Large tasks invoke tools repeatedly; the description cost amortizes and the per-call savings accumulate.

How to apply: route sessions to MCP-heavy profiles only when you have a prior expectation that the task will be large. For short Q&A, simple lookups, or single-file edits, the lean none or core profile is faster and cheaper. Build the routing into your agent harness or, at minimum, into your operational playbook.

The honest caveat: the 40-55% savings number is from N=2 tasks crossing the 5,000-baseline threshold in our sample. The direction is consistent; the exact threshold and percentage will move with broader measurement. Treat 5,000 as a working anchor, not a precise cutoff.

Framework 2 — Profile-task fit > profile size

Statement: matching the MCP toolbox to the task class matters more than the total MCP count. A profile that excels on one task class can lose 100% of attempts on another.

The data: our 9-MCP repo profile, which bundles retrieval-mcp and context-prep-mcp prominently (both tuned for codebase queries), completed 7/7 code-focused tasks in our golden set. The same repo profile timed out at 1,200 seconds on 7/7 docs-heavy tasks. Not "performed worse" — timed out completely.

We don't have per-tool-call timing data to confirm the mechanism, but the hypothesis with the most support is that retrieval-mcp rabbit-holes on docs queries — chasing ever-broader context windows trying to ground an answer that requires reading rather than retrieving. On code work, where queries are bounded by file or symbol scope, the same MCP delivers value cleanly.

How to apply: classify your task before picking your profile. Even a rough taxonomy (code-narrow, docs-heavy, shell-ops, browser-automation, mixed) is enough to make the choice meaningful. The default profile when task class is unknown should be the lean one (core-equivalent), not the maximalist one. Maximalist profiles concentrate failure modes; lean profiles distribute them.

This generalizes beyond MCPs: it's the same principle as choosing the right tool for the job in any toolchain. The MCP context just makes the cost of getting it wrong measurable.

Framework 3 — Multi-axis MCP evaluation

Statement: the decision to load an MCP into your stack should be evaluated on three axes, not one. Trim only when all three axes say drop. Default = keep.

Axis The question What measurement tells us
Token cost Does this MCP measurably add session tokens? Measurable at scale (N≥60). Most MCPs in our sample don't measurably cost on the mean.
Capability enabled What workflow becomes impossible or much harder without it? NOT measurable from token data alone. User-judgment per MCP.
Frequency × alternative Used regularly enough AND has cheap alternative? NOT measurable from token data alone. User-judgment per session pattern.

Why three axes: because two of them are unmeasurable from any token benchmark. If you trim MCPs based on token cost alone, you'll inadvertently remove tools whose capability is irreplaceable (you don't notice the cost until the workflow that needed them fails). If you trim based on frequency alone, you'll remove the safety-net MCPs that fire 5% of the time but unblock high-value work when they do.

The discipline: before removing any MCP from your default-loaded stack, answer:

  1. Is its measurable token cost dominant? (Usually no — our data shows most MCPs in the 6-17 range don't measurably cost on the mean.)
  2. Is there a capability it enables that you have no alternative for? (For most specialty MCPs in our stack, yes — vision-mcp for screenshot analysis, scraper-stack for URL fetch on hostile sites, etc.)
  3. Have you genuinely not used it in 90 days, AND can you predict not needing it for edge cases?

If you answer "no" to any of these, the safer default is to keep the MCP. The cost of running it is small. The cost of removing it and then needing it is large.

This framework prevented us from making a mistake we almost made — and that brings us to the retraction.

Discipline

What we measured wrong (and why the retraction matters)

The frameworks above didn't emerge cleanly. We had to retract a public claim to earn the third one.

Two weeks before this article, we published a smaller version of this benchmark at N=9 tasks. The headline finding then: "the repo profile has the highest task completion rate at 88.9%."

That claim came with a caveat — Wilson 95% CI [57%, 98%], with the line "needs N≥20 to confirm." Sufficient academic hedging, we thought. Not sufficient in practice.

At N=28 on the same profile, the completion rate landed at 67.9% — Wilson CI [49.3%, 82.1%]. That's the lowest completion rate among the four profiles. The provisional claim was wrong by direction, not just by precision.

The error was structural. A 95% CI of [57%, 98%] admits both "highest of four profiles" (upper bound 98%) and "lowest of four profiles" (lower bound 57%) with equal statistical force. A claim that the data cannot distinguish from its opposite isn't a finding. It's a measurement gap.

We codified the discipline into our engineering protocol: if the 95% CI for a comparative claim crosses the claim's own polarity, don't publish the claim — even provisionally. Publish only the bound.

The required disclosure format for under-evidenced data:

Measurement at N=<n>: [point estimate] [Wilson|bootstrap] 95% CI [lower, upper].
CI is wide enough that BOTH [direction A] and [direction B] remain admissible.
NO directional claim.
Expansion to N≥<target> required to discriminate.

This is more cumbersome prose than "provisional finding: X is highest." It doesn't propagate as a misleading anchor when readers summarize it.

The retraction matters here because it's the same kind of mistake we almost made on the multi-axis evaluation framework. From the same N=100 data, the easy story would have been "Codex stack at 33 MCPs is way past the measured 'full hurts' threshold; trim it." That story is wrong in the same way the original "repo highest" claim was wrong: it draws a multi-axis prescription (trim) from a single-axis null result (token cost not measurably high). The fix is the multi-axis framework, not the trim.

Discipline like this earns the right to make claims that hold up. We'd rather publish narrower claims that survive than headline claims we have to walk back. The asymmetry favors caution.

Apply tomorrow

Practical recommendation table

The three frameworks compress into a recommendation table you can apply tomorrow.

Anticipated task size Recommended profile Why
<2,000 baseline tokens (simple Q&A, single-file edits, status checks) none or lean core (0-6 MCPs) MCP description cost dominates; you finish before MCPs amortize
2,000-5,000 baseline (multi-step but bounded) core (6 MCPs) or class-matched profile Noisy regime; default to lean unless task class is clearly suited to a specific toolbox
>5,000 baseline (complex multi-step, multi-file synthesis, research) Class-matched specialty profile (e.g. repo for code, doc-loading for docs) MCPs amortize their description cost and save 40-55% on the tasks we measured
Unknown or mixed class core (6 MCPs, general-purpose) Lean default with best balance of completion rate (82.6%) and token cost (essentially tie with none)
Question about a specific MCP Decision criterion
Add to baseline-loaded stack? Used ≥30% of sessions AND no functional overlap with existing
Move to opt-in (load per task)? Used <30% of sessions OR specialty domain (figma, gmail, browser tools)
Remove entirely? All THREE axes say drop: measurable token cost dominant + no unique capability + no edge-case need

The defaults across the table: lean profile when in doubt; default to keep when evaluating removal; route by task class when you can predict it.

Limits

Honest limits and what's next

This work has measurement gaps we want to flag rather than minimize.

Single-shot vs multi-turn. Every measurement here is one claude --print call per task. Real agentic work is multi-turn with prompt caching across turns within Anthropic's 5-minute window.3 Multi-turn cost dynamics could plausibly differ — MCPs that add a stable preamble would have their description cost cached across turns, making the per-turn cost much lower than the cold-start cost we measured. We haven't measured this. The benchmark would need a multi-turn driver and per-turn cache-hit attribution.

Tool counts, not tool calls. We know which profile had MCPs available. We don't yet know which MCPs Claude used during each session. A core profile run might use 0-1 of the 6 available MCPs; a repo run on docs-heavy work might invoke retrieval-mcp 15 times before timing out. Those are different cost profiles even at the same final token total. We shipped a tracing primitive in our cc-eval-runner (per-MCP-call timing via --trace-tools) to collect this data in the next eval cycle.

Latency-vs-tokens isn't statistically compared. Median latency by profile (24s for none, 28.6s for core, 31.0s for repo, 30.1s for full) suggests repo is slower. We haven't run paired latency comparisons with bootstrap CIs. The eyeball ranking might be within noise.

Task class taxonomy is coarse. Our D-tier / E-tier / G-tier labels are useful but not granular. A finer taxonomy (e.g. by domain, by required tool family) would likely refine the profile-task fit framework further.

The next batch of work targets the first two gaps. Per-MCP-call tracing on the seven repo-profile timeout tasks will tell us which specific MCP consumes the 1,200-second budget on docs-heavy work. A small multi-turn driver will tell us whether cache amortization changes the threshold.

Reproducibility

Reproducibility

The public-facing measurement tools — harness scaffolding, golden-task fixtures, and the eval_rigor stdlib-only Python module (Wilson CI, bootstrap CI, Welch's t-test) — live in the mcp-token-savers repo4. Run them on your own stack, or clone the bench harness and substitute your own golden tasks. No paid API calls; everything runs at $0 marginal cost via Pro/Max OAuth on a Mac Mini.

git clone https://github.com/g-shevchenko/mcp-token-savers
cd mcp-token-savers
# See README.md for harness usage + the eval_rigor module API.
# Aggregation + pairwise comparisons + bootstrap CI helpers live under scripts/.

The exact N=100 (task, profile) raw JSONs and ablation scripts behind this article remain in the source-of-truth eval harness on the Mac Mini that produced them; the public tools above are sufficient to reproduce the methodology on your own golden set. If you replicate or refute the findings, share the results — the discipline of independent measurement is exactly what this article is about.

Closing

Three frameworks earned the right to be in this article.

The task-size threshold rule, profile-task fit over profile size, and multi-axis MCP evaluation. Each was forced by the data and the retraction discipline we built around it. They're reusable beyond our specific stack.

The practical recommendations compress into a table you can apply this week. The honest limits give you a roadmap for what your own measurements should refine.

If you're running an agentic stack and you've never measured your profile choices against task class, the cost of not measuring scales with how many sessions you run per day. Even a small benchmark on your own golden set — five tasks, two profiles, one weekend — gives you better routing than picking by intuition.

Measuring isn't free. Not measuring isn't free either. The difference is which cost you pay knowingly.

Honest framing note. This article is written by the same person who built the cc-eval-runner harness and ran the measurements. That combination of roles is a known source of bias. The eval scripts, raw JSONs, and eval_rigor module live in the public repo so any reader can replicate or refute the numbers on their own infrastructure. Where I report "40-55% savings" or "67.9% completion rate," I'm reporting an effect measured on one harness against one golden set at the stated N. The retraction in this article is exactly that risk firing. Replicating with other golden sets, other harnesses, larger N, and ideally another evaluator is the work that turns these into general claims rather than my own claims about my own infrastructure.

Appendix A

Full per-profile data table

Profile n (completed) Mean tokens_in Mean tokens_out Mean tokens_total Median latency (s) Completion rate
none (0 MCPs)2188.51,921.82,010.324.480.8% (21/26)
core (6)1982.71,858.71,941.428.682.6% (19/23)
repo (9)1971.42,355.72,427.131.067.9% (19/28)
full (17)1746.51,998.52,044.930.173.9% (17/23)

n = completed task runs per profile. Mean tokens include only completed (non-timeout) runs. Completion rate = (completed) / (total attempts including timeouts and errors).

Appendix B

All six pairwise paired comparisons

Comparison n_paired Mean Δtokens_total Stdev Direction
repo vs none16−578637 saved / 9 used more
core vs none17+411,2493 saved / 14 used more
full vs none16+1127886 saved / 10 used more
repo vs core14−6370610 saved / 4 used more (soft directional signal)
full vs core15+1641,6625 saved / 10 used more (full hurts)
full vs repo14+3001,3096 saved / 8 used more

All 95% CIs include zero. We publish the bounds and direction-split per the polarity-guard rule.

Appendix C

The polarity-guard rule

The decision tree, codified as the engineering rule that survived this article's near-miss:

1. Compute the 95% CI for the claim's quantity.
2. Are both "X is best" AND "X is worst" admissible under the CI?
   ├── YES → BLOCK: publish only the CI bound. No directional claim.
   └── NO  → Continue.
3. Does the CI cross zero (for diff claims) or 0.5 (for indicator claims)?
   ├── YES → Caveat heavily; consider re-running before publishing.
   └── NO  → Safe to publish with the bound stated.
4. Is the effect size large enough to be operationally meaningful?
   ├── NO  → Note as "statistically detectable but operationally negligible".
   └── YES → Publish.

Lives at .claude/rules/eval-discipline-polarity-guard.md in our engineering rule set. Mirrored to the other three agentic IDEs we run (Cursor, Codex, Windsurf) for cross-tool consistency.

Sources

Sources

Related reading