Action receipts measured: +80pp controlled, mixed prod

Q: Why not just compress the response?

Compression does not solve cache-key jitter. Two responses can have identical content after compression yet differ on a wrapper field such as timing or session ID, which is what the prompt cache hashes.

Q: Should I default-on the receipt path in my own stack?

Not without measurement. Run a controlled-jitter fixture against your stack, then test real production targets to bound wall-time overhead and confirm your noise pattern set catches actual jitter.

Q: How does the +80pp number compare to part 1's 75.5%?

They measure different things. Part 1 measures total-token reduction across a 12-task agent-loop benchmark, while this article measures cache-friendliness on one controlled tool-call fixture.

Q: Where do I get the public benchmark tools?

The public harness, fixtures, and canonical-bytes implementation live at github.com/g-shevchenko/mcp-token-savers.

Recap

What was the two-axis framework, in one paragraph?

If you read part 1, you can skip this section. Action receipt is a small, byte-stable JSON record of one browser action. It captures pre/post page state, the action itself, errors, and observability, in a schema that strips known jitter before hashing. The framework part 1 introduced measured any MCP stack on two axes. Byte saving — does the tool return fewer raw tokens than the alternative? Lower is better. Cache-friendliness — does the same semantic call return byte-identical output across runs? Higher is better. Measured as the fraction of N identical-input calls whose canonical hash equals the modal hash. 100% means every call hits Anthropic's 5-minute prompt cache perfectly;⁴ lower means jitter is leaking and the cache rebuilds.

The piece that drew engagement after part 1 was the cache-friendliness axis. The 5-minute TTL is easy to miss in local testing. You don't notice it; you just pay 10× more tokens in production than you expected. Lakshman Turlapati's FSB on GitHub.com validated the shape for browser MCPs. And a Reddit commenter, pquattro, confirmed they'd seen the same pattern in self-hosted retrieval-layer caches: unordered filesystem traversals were costing them cache hits until they sorted the output. In their words, a one-liner worth a 2% perf hit.⁵

So the framework resonates. This piece is what happened when I tried to apply it to my own browser-MCP code.

The pattern

What is an action receipt, and why does it need a schema?

The browser-MCP layer in our scraper-stack is one endpoint, POST /interact. It takes a list of actions (goto, wait, click, type, and so on) plus a capture spec, runs them against a persistent Camoufox session, and returns the result. Before this work, the response was a fairly large object. Full HTML, all extracted text, links, attributes, screenshot bytes, plus a wrapper carrying engine, duration_ms, session_id, tier_attempts, and a few other fields that change every call.

That wrapper is the jitter source the cache-friendliness axis cares about. Even when the page is byte-stable, the wrapper isn't. The agent's next prompt contains the wrapper. So the cache misses on every retry.

The fix borrows from FSB's architectural shape: separate the agent loop into three distinct tools.

Role	What it does	Cache property
`read`	Get page state without side effects.	Cacheable by URL + args.
`act`	Single state-mutating action.	Rarely cacheable (each action is unique).
`verify`	Inspect post-action state via a small byte-stable receipt.	Cacheable by design — the receipt schema strips known jitter sources before hashing.

The receipt — scraper-mcp.act_receipt.v1 — looks like this:

{
  "schema_version": "scraper-mcp.act_receipt.v1",
  "action": {"type": "click", "selector": "button.submit"},
  "pre_state":  {"url": "...", "dom_region_hash": "sha256:abc..."},
  "post_state": {"url": "...", "dom_region_hash": "sha256:def...",
                 "changed": true, "stable": true, "navigated": false},
  "errors":     {"console": [], "network": [], "selector_not_found": false,
                 "timeout": false, "action_failed": null},
  "observability": {"tier_used": "camoufox", "duration_ms": 234,
                    "timing_breakdown": {"...": 0}}
}

The canonical-bytes algorithm strips the observability subtree entirely and the dom_region_size_bytes from both states. Everything else gets sorted-keys JSON-serialized with UTF-8 encoding. The SHA-256 of that becomes the cache key. Two semantically-identical receipts produce byte-identical canonical bytes. They hash the same. The prompt cache hits.

I ported the algorithm to three runtimes: server Python (the FastAPI handler), client Python (the harness that builds fixtures), and JavaScript (the Node.js A/B runner). All three must agree on the byte representation, or the cache key drifts. The cross-runtime equivalence test runs three fixtures through all three implementations and asserts byte-identical output. It's the cheapest reliability gate the whole design depends on, and it caught two real bugs during development. Sorted-keys output differs subtly across Python json.dumps separators and JavaScript JSON.stringify defaults; I had to write a custom recursive stringifier.

Measurement · Scenario 1

AB1 — does the receipt break anything on a stable target?

The harness fires the same scenario twice per run: once with capture.act_receipt=false (control — the existing behavior, capture as before) and once with capture.act_receipt=true (treatment — additional receipt path, hash the receipt instead of the raw response). N runs per arm, cold sessions, identical inputs.

The first scenario points at iana.org/help/example-domains, an RFC 2606 reference page that doesn't change minute-to-minute. I picked it because part 1's baseline showed it was 100% cache-friendly on the simpler /fetch endpoint. If the design adds nothing on a target where the existing strip is already perfect, that's fine. What I needed to rule out was regression.

Arm	N	Score	Unique hashes
Control	5	100.0%	1
Treatment	5	100.0%	1
Delta		+0pp

No regression. Both arms tied at the ceiling. The receipt path doesn't break anything on already-cache-friendly content. That's the answer AB1 was designed to give. What it doesn't show, and couldn't on this target, is whether the receipt adds anything when jitter is actually present. That needed a different scenario.

Measurement · Scenario 2

AB2 — does the receipt work on controlled jitter?

I added a small HTTP endpoint to scraper-core specifically for this measurement, GET /test/jitter. It returns a minimal HTML document whose only varying element across requests is a <div data-time="<milliseconds-since-epoch>"> attribute. Every other byte is identical.

The data-time attribute lives in the receipt's DOM_NOISE_ATTR_PATTERNS set, the regex list that dom_region_hash strips before SHA-256. So this fixture isolates exactly one variable: does the strip actually work?

Arm	N	Score	Unique hashes
Control (capture.full_html=true)	5	20.0%	5
Treatment (capture.act_receipt=true)	5	100.0%	1
Delta		+80pp

The control arm captures full_html=true, so the raw HTML, including the varying data-time bytes, goes into the response. Five identical requests, five distinct hashes. The treatment arm gets the receipt, whose dom_region_hash strips data-time per noise pattern. Five identical requests, one stable hash.

This is the cleanest possible measurement of the design value. Where its noise patterns hit, it works. The cross-runtime invariant held: all three runtimes hashed identically across the five runs.

Measurement · Scenario 3

AB3 — does the receipt work on a real public target?

The honest question AB2 doesn't answer: do real-world pages have jitter that the current pattern set actually catches? Or are real-world pages full of text-content jitter (visible timestamps, vote counts, dynamic IDs) that the data-* attribute-only patterns miss?

I ran AB3 against news.ycombinator.com. A real public site I don't operate, no anti-bot wall, with predictable but multi-source jitter ("X hours ago" text in spans, vote counts, occasional story rotation on the front page). N=20 cold Camoufox sessions per arm.

Arm	N	Score (modal)	Unique hashes	Mean wall	p95 wall
Control	20	75%	3	8.1s	14.1s
Treatment	20	50%	2	11.1s	21.8s
Delta		−25pp (modal)	−1 (better)	+3.0s	+7.7s

The modal-fraction metric reads as a 25-point regression. The unique-hash count reads as a 1-hash improvement. They disagree because of the bimodal distribution at this N. HN had one substantive content change between runs 10 and 11. A story slipped off the front, dom_region_size_bytes dropped from 34,685 to 34,670, a 15-byte difference. Treatment captured that one true change cleanly: 10 + 10. Control's smaller per-run jitter (vote counts, "X minutes ago" text) clustered into the modal hash 15+4+1. The modal-fraction metric rewards clustering regardless of unique count.

If we report by unique-hash count: control 1 − 3/20 = 85%; treatment 1 − 2/20 = 90%. Delta +5pp. If we report by modal fraction: control 75%; treatment 50%. Delta −25pp. Same data, opposite conclusions. The honest answer is both metrics are partial and need to be reported together, or N needs to grow until modal-fraction stabilizes (probably N ≥ 50 for jitter as continuous as HN's).

And the wall-time. The receipt path adds two extra page.evaluate("body.outerHTML") calls per /interact request, one before the action sequence and one after. On the HN page (about 35 KB of body HTML), that's ~3 seconds mean and up to ~8 seconds at p95. Hash computation is microseconds. The cost is the browser serializing 35 KB of DOM twice.

On a Hermes scanner running 100 calls per day, that's an extra 5 minutes of wall-time per day. Manageable. On a hot path running thousands of calls per day, that's an hour. Not free.

The story

How did a +77.8pp delta on a real target turn out to be a measurement artifact?

The first AB3 run didn't return the numbers above. It returned this:

Control:   22.2% (n=18/20, unique_hashes=6)
Treatment: 100.0% (n=18/20, unique_hashes=1)
Delta:     +77.8pp

That's a beautiful number on a real prod target. It would have been the headline of this article. It's also wrong.

I wrote the AB3 scenario with act_receipt_region: "table.itemlist", a CSS selector I'd lifted from a stale Hacker News HTML reference. Modern HN uses table#hnmain for its outer table and tr.athing for story rows. There is no .itemlist anywhere. When the receipt's dom_region_hash function called document.querySelector("table.itemlist"), the result was null, and the function returned the hash of an empty string.

Every one of the 18 successful treatment runs hashed the same empty-region sentinel. Treatment "100% byte-stable" meant "stably measures nothing." The control arm captures full_html=true regardless of selector. So it saw the real page jitter and registered six distinct hashes. The +77.8pp delta was the gap between a meaningful real measurement and a meaningless empty one.

I caught it the same way I'd want anyone reviewing this article's numbers to catch them: by deterministically reproducing the result outside the system that produced it.

from scraper_core.act_receipt_assembly import build_act_receipt

empty_receipt = build_act_receipt(
    pre_html="",   # selector missed → empty string
    post_html="",  # selector missed → empty string
    pre_url="about:blank",
    post_url="https://news.ycombinator.com/",
    # ...other fields default
)
empty_receipt.canonical_sha256()
# → "a753a0fd8b1fcaeda0d002aaf5b088812a9672700ba97eb7f186b4bd5172d621"

The observed treatment hash across all 18 runs: a753a0fd8b1fcaeda0d002aaf5b088812a9672700ba97eb7f186b4bd5172d621. Byte-identical match.

I'm the author of this design and the author of these measurements. When those two roles overlap, you don't get to grade your own output without a deterministic re-verification step. The Python reproduction is that step.

Without it I would have shipped this article with a fake +77.8pp number. The most damaging part is I'd have believed it. The number had survived the unit tests, the cross-runtime equivalence check, and a smoke run on a different target. Three green signals, one false headline.

Why did each green signal miss? Unit tests were green because the receipt assembly was correct given the inputs. The cross-runtime check was green because all three runtimes hash empty strings to the same canonical bytes. The smoke target wasn't HN. The bug lived in the seam between scenario configuration and the real DOM, in a place no isolated test would touch.

I shipped a permanent guard. The harness now records post_region_size_bytes per run and flags any treatment arm where all successful runs fall below a small threshold (currently 200 bytes — well below the smallest plausible real DOM region):

def detect_selector_miss_artifact(treatment_runs):
    sizes = [r["post_region_size_bytes"] for r in treatment_runs
             if r["post_region_size_bytes"] is not None]
    if not sizes:
        return False, None
    if max(sizes) < 200:
        return True, "selector likely missed; treatment is hashing empty content"
    return False, None

When this fires, the harness writes INVALID MEASUREMENT to its output and refuses to compute a delta. Future selector misses cannot silently produce false positives.

I also re-ran the AB2 fixture through the same Python repro: the observed treatment hash 51e0d228... does not match the empty-region sentinel for the /test/jitter URL (390419e7...). AB2 measured real <body> content. The +80pp result stands.

The engineering call

Why is the default-on flag staying off?

Production changes require measurement evidence, not theoretical support. Going into this work, I expected to flip the receipt's server-side feature flag to default-on once the design was proven. The data, taken honestly, doesn't support that flip.

Three factors:

Mechanism is proven. AB1 invariant + AB2 controlled gain show the design is correct.
Real-prod gain is modest at current pattern coverage. Treatment had one fewer unique hash than control on HN at N=20. That's real but not the +80pp the controlled fixture promises.
Wall-time overhead is substantive. +3 seconds mean, +37%, +7.7 seconds at p95. Not free.

So the receipt is an opt-in discipline, not a universal one. The flag (SCRAPER_ACT_RECEIPT_ENABLED) stays off by default. Callers who know their target has jitter the receipt's noise patterns will catch — and who care about cache key stability for multi-turn flows — opt in per request. Callers running hot single-shot loops don't, because the overhead isn't worth it.

This is less satisfying than "we flipped a flag and saved 80% of tokens." It's the right answer given the data.

Where companies go wrong

Three patterns explain why the receipt design is correct in principle yet narrow in default coverage.

Schema-bounded jitter strip beats ad-hoc cleanup. Most cache-friendliness issues in MCP responses come from response wrappers carrying transient fields (timing, IDs, tier metadata). Where teams typically go wrong: they .delete() fields ad-hoc in each caller, drift across callers, then wonder why cache hit rate is uneven. A schema with an explicit "strip these subtrees" contract gives a single source of truth.

Cross-runtime byte equivalence is non-negotiable. If your server builds the cache key one way and your client builds it another (whitespace, key order, separators), the cache misses forever and you don't notice. The cheap fix is a shared fixture file plus a per-runtime test that hashes the fixture and asserts the same hex digest. Teams that skip this gate spend weeks chasing "the cache works in dev, not in prod" mysteries.

The author can't validate their own benchmark alone. When the same person designs the measurement and runs it, an independent verification step is required. The deterministic-repro pattern (build the failing receipt by hand, hash it, compare bytes to the observed output) is the cheapest such step. Teams that skip this ship false positives like the +77.8pp one above.

What's next

How do per-target noise patterns extend this design?

The piece AB3 made obvious is that the receipt's default noise pattern set is narrow. It strips data-time, data-timestamp, data-frame, data-focused attributes. Those are common on JS-heavy SPAs (the original FSB targets) but not the dominant jitter source on a server-rendered page like HN. HN's jitter is in the visible text — "5 minutes ago" inside a <span class="age">, vote counts in <span class="score">.

The natural next layer is per-target noise pattern extension. A site_guide.v1 schema that says "for news.ycombinator.com, additionally strip the text content of any <span class="age">; for gemini.google.com, additionally strip the streaming-cursor positions." Per-target measurement, per-target gain. This is what FSB does implicitly through its per-site action recipes; making it a structured schema lets the receipt's canonical-bytes contract extend cleanly.

I did an audit of the last 30 days of /interact traffic on the scraper-core host. Three targets meet the criteria for per-site work (≥ 50 requests per month and currently escalating to a slow tier): zhihu.com (117 hits, 100% on the slow Patchright tier, and — separately worrying — a 77% error rate that's a reliability problem regardless of cache-friendliness), reddit.com (80 hits, 100% on Camoufox, no errors but slow), and humanswith.ai (60 hits but already 92% on the cheap firstparty tier — skip).

The phase-2 build is queued. I'll write about it when there's data, not before.

Open questions

What still don't I have answers for?

Multi-turn cache compounding. pquattro's question in the v1 thread: does fixing cache-friendliness in tool call N actually compound into measurable savings at turn N+1, N+2, N+k? I have a paired-session harness that logs Anthropic's cache_creation_input_tokens and cache_read_input_tokens per turn, but at the current N (paired ~17 Codex + 9 Claude Code on a 60-task golden set) the confidence intervals are too wide to assert a tight effect size. N ≥ 60 is queued.
Per-MCP cache-output-survival attribution. Anthropic's API returns aggregate cached-vs-creation token counts, not per-content-block survival. So "did the vision-mcp's prep output specifically survive 3 turns and save K tokens downstream" isn't directly observable. The cleanest approach I can see is surrogate logging (track per-turn prompt overlap in the harness) or one-MCP-at-a-time ablation. I've done the bundle-level version of the second. The within-bundle version is still TBD.
Metric stability at small N. AB3 surfaced this: modal-fraction and unique-hash-count can disagree under bimodal real-world distributions at N=20. The honest fix is N ≥ 50 + reporting both metrics. The cheaper fix is to graph the empirical distribution and let readers see the shape. I'll do both in the next iteration.

FAQ

Frequently asked questions

What is an action receipt, in one sentence?

A small, byte-stable JSON record of one browser action that captures pre/post page state, the action, errors, and observability in a schema that strips known jitter before hashing, so identical semantic actions produce identical cache keys.

Why not just compress the response?

Compression doesn't solve cache-key jitter. Two responses can have identical content after compression yet differ on a wrapper field (timing, session ID), which is what the prompt cache hashes. The receipt's contribution is structural: strip the jitter, then hash, then cache.

Is this only for browser MCPs?

The pattern generalizes to any MCP whose response carries transient fields the LLM doesn't need but the cache hashes. Retrieval MCPs, scraper MCPs, anything that talks to live systems. FSB was the first one to formalize it for browser actions; the schema in this article is one possible canonical form.²

Should I default-on the receipt path in my own stack?

Not without measurement. Run the controlled-jitter fixture (/test/jitter) against your stack to confirm cross-runtime invariant. Then run against your real prod targets to bound wall-time overhead and confirm the noise pattern set catches your actual jitter. If real-prod gain is modest, keep it opt-in.

How does the +80pp number compare to part 1's 75.5%?

They measure different things. Part 1's 75.5% is total-token reduction across a 12-task agent-loop benchmark (raw bytes saved + cache hits combined).¹ This article's +80pp is cache-friendliness specifically, measured on one tool call against one controlled fixture.³ The numbers aren't comparable; they're complementary.

Where do I get the public benchmark tools?

github.com/g-shevchenko/mcp-token-savers. Harness, fixtures, canonical-bytes implementation in JavaScript, all reproducible.³

Sources

Which sources back this research page?

[1] Gregory Shevchenko — Code Execution with MCP: when mcp-token-savers reduce AI agent loop cost

The first part of this research series; use it for the 12-task loop benchmark, the 75.5% total-token reduction context, and the two-axis MCP evaluation frame.

gregshevchenko.com/research/mcp-stack-token-economy/

[2] Lakshman Turlapati — FSB browser-agent project

The browser-MCP design reference that shaped the action-receipt approach discussed here.

github.com/LakshmanTurlapati/FSB

[3] g-shevchenko — mcp-token-savers public benchmark repo

The public reproducibility surface for the harness, fixtures, canonical-byte implementation, and issue tracking.

github.com/g-shevchenko/mcp-token-savers

[4] Anthropic documentation — Prompt caching

The product documentation behind the cache behavior this research tries to preserve.

docs.claude.com/en/docs/build-with-claude/prompt-caching

[5] Reddit discussion — r/ClaudeAI thread on part 1

The community discussion that surfaced the repeated-cache-output concern and helped motivate this second measurement pass.

reddit.com/r/ClaudeAI/comments/1tn6cey/

What did this work actually change?

I started this work expecting to publish a "+80pp on real prod" headline. I'm publishing a "the design works where its noise patterns hit; here's what the data actually shows and here's the bug that almost made it through" piece instead. The piece is less attention-getting and more useful, which is a reasonable trade.

The public benchmark tools — the harness, the fixtures, the canonical-bytes implementation in JavaScript — live at github.com/g-shevchenko/mcp-token-savers. Run them on your own stack. If the receipt design doesn't hit on your targets, the path forward is per-target noise patterns, not a wider universal default.

Honest framing note. This article is written by the same person who proposed the receipt design and ran the measurements. That combination of roles is a known source of bias. The deterministic Python reproduction in the artifact section is the cheapest blind-validation step I had available. The published harness and fixtures at the repo above let anyone else reproduce the AB1, AB2, AB3 measurements on their own infrastructure — that's the second layer of validation, and the one I'd actually trust if you replicated my numbers. Where I report "+80pp" or "+3 seconds" or "+5pp by unique count," I'm reporting an effect measured on one target at one N. I've tried to make the limits explicit. Replicating with other targets, larger N, and ideally another evaluator is the work that turns these into general claims rather than my own claims about my own code.

Discussion: the r/ClaudeAI thread on part 1 is where this work began. The GitHub.com issues are where reproductions and corrections land.

Republished on Medium

Read and share the Medium.com version

Discuss on LinkedIn

Read the LinkedIn.com cross-post and join the thread

We measured our own scraper-stack. The receipt design works on controlled jitter, but real prod is harder.

What you'll learn

What was the two-axis framework, in one paragraph?

What is an action receipt, and why does it need a schema?

AB1 — does the receipt break anything on a stable target?

AB2 — does the receipt work on controlled jitter?

AB3 — does the receipt work on a real public target?

How did a +77.8pp delta on a real target turn out to be a measurement artifact?

Why is the default-on flag staying off?

Where companies go wrong

How do per-target noise patterns extend this design?

What still don't I have answers for?

Frequently asked questions

What is an action receipt, in one sentence?

Why not just compress the response?

Is this only for browser MCPs?

Should I default-on the receipt path in my own stack?

How does the +80pp number compare to part 1's 75.5%?

Where do I get the public benchmark tools?

Which sources back this research page?

What did this work actually change?

We measured our own scraper-stack. The receipt design works on controlled jitter, but real prod is harder.

What you'll learn

What was the two-axis framework, in one paragraph?

What is an action receipt, and why does it need a schema?

AB1 — does the receipt break anything on a stable target?

AB2 — does the receipt work on controlled jitter?

AB3 — does the receipt work on a real public target?

How did a +77.8pp delta on a real target turn out to be a measurement artifact?

Why is the default-on flag staying off?

Where companies go wrong

How do per-target noise patterns extend this design?

What still don't I have answers for?

Frequently asked questions

What is an action receipt, in one sentence?

Why not just compress the response?

Is this only for browser MCPs?

Should I default-on the receipt path in my own stack?

How does the +80pp number compare to part 1's 75.5%?

Where do I get the public benchmark tools?

Which sources back this research page?

Where should you go next?

What did this work actually change?