Original Title: Why My OpenClaw Sessions Burned 21.5M Tokens in a Day (And What Actually Fixed It)
Original Author: MOSHIII
Translation: Peggy, BlockBeats

Editor’s Note: In the current rapid popularization of Agent applications, many teams have discovered a seemingly abnormal phenomenon: the system runs perfectly fine, but token costs have been continuously rising unknowingly. This article analyzes a real OpenClaw workload and finds that the reasons for cost explosions often do not come from user inputs or model outputs, but from overlooked context cache replay. The model repeatedly reads a massive historical context in each round of calls, resulting in huge token consumption.

The article combines specific session data to demonstrate how tool outputs, browser snapshots, JSON logs, and other large intermediate products are constantly written into historical context and repeatedly read in the agent loop.

Through this case, the author proposes a clear optimization idea: from context structure design, tool output management to compaction mechanism configuration. For developers building Agent systems, this is not only a technical troubleshooting record but also a money-saving strategy.

The following is the original text:

I analyzed a real OpenClaw workload and discovered a pattern that I believe many Agent users would recognize:

Token usage looks very "active"

Responses also appear quite normal

But token consumption suddenly skyrockets

Below are the structural breakdown, root causes, and feasible repair paths from this analysis.

TL;DR

The biggest cost driver is not that user messages are too long. Rather, it's the massive cached prefix being repeatedly replayed.

From session data:

Total tokens: 21,543,714

cacheRead: 17,105,970 (79.40%)

input: 4,345,264 (20.17%)

output: 92,480 (0.43%)

In other words: The cost of most calls is not actually from processing new user intentions, but from repeatedly reading a vast historical context.

"Wait, how could this happen?" moment

I initially thought the high token usage came from: very long user prompts, a large amount of output generation, or expensive tool calls.

But the real dominant pattern is:

input: hundreds to thousands of tokens

cacheRead: 170,000 to 180,000 tokens per call

That is to say, the model is repeatedly reading the same massive stable prefix every round.

Data Scope

I analyzed data from two levels:

1. Runtime logs
2. Session transcripts

It should be noted that:

Runtime logs are primarily used to observe behavior signals (such as restarts, errors, configuration issues)

Accurate token statistics come from the usage field in session JSONL

Scripts used:

scripts/session_token_breakdown.py

scripts/session_duplicate_waste_analysis.py

Generated analysis files:

tmp/session_token_stats_v2.txt

tmp/session_token_stats_v2.json

tmp/session_duplicate_waste.txt

tmp/session_duplicate_waste.json

tmp/session_duplicate_waste.png

Where is the actual token consumption?

1) Session Concentration

There is one session that consumed far more than the others:

570587c3-dc42-47e4-9dd4-985c2a50af86: 19,204,645 tokens

Then there is a significant drop:

ef42abbb-d8a1-48d8-9924-2f869dea6d4a: 1,505,038

ea880b13-f97f-4d45-ba8c-a236cf6f2bb5: 649,584

2) Behavior Concentration

Tokens primarily come from:

toolUse: 16,372,294

stop: 5,171,420

This indicates that the issue mainly lies within tool call chain loops, rather than normal chatting.

3) Time Concentration

Token peaks are not random but are concentrated in several time periods:

2026-03-08 16:00: 4,105,105

2026-03-08 09:00: 4,036,070

2026-03-08 07:00: 2,793,648

What's inside the massive cached prefix?

It is not the dialogue content, but mainly large intermediate products:

Huge toolResult data blocks

Long reasoning/thinking traces

Large JSON snapshots

File lists

Browser scraping data

Dialogue records of sub-Agents

In the largest session, the character counts are approximately:

toolResult:text: 366,469 characters

assistant:thinking: 331,494 characters

assistant:toolCall: 53,039 characters

Once these contents are retained in the historical context, subsequent calls may read them again through the cache prefix.

Specific Example (from session files)

Large chunks of context repeatedly appeared in the following locations:

sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:70

Large gateway JSON logs (about 37,000 characters)

sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:134

Browser snapshots + Secure encapsulation (about 29,000 characters)

sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:219

Huge file list outputs (about 41,000 characters)

sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:311

session/status state snapshots + large prompt structures (about 30,000 characters)

"Duplicate Content Waste" vs "Cache Replay Burden"

I also measured the proportion of duplicate content within single calls:

Duplicate ratio approximately: 1.72%

There is indeed some, but it is not the main issue.

The real problem is: the absolute size of the cached prefix is too large

The structure is: huge historical context, re-reading for each call, with only a small amount of new input added on top.

Thus, the focus of optimization should not be deduplication, but context structure design.

Why is this problem particularly easy to occur in Agent loops?

Three mechanisms overlap:

1. A large amount of tool output is written into historical context

2. Tool loops produce a large number of short interval calls

3. The prefix varies little → the cache will read them again each time

If context compaction is not triggered stably, the problem will amplify rapidly.

Most Important Repair Strategies (sorted by impact)

P0—Do not shove large tool outputs into long-term context

For super-large tool outputs:

· Keep summaries + reference paths/IDs

· Write the original payload into file artifacts

· Do not keep full originals in chat history

Prioritize the limitation of these categories:

· Large JSON

· Long directory lists

· Full browser snapshots

· Full transcripts of sub-Agents

P1—Ensure the compaction mechanism is truly effective

In this data, compatibility issues with configurations appeared multiple times: invalid compaction key

This can quietly disable optimization mechanisms.

The correct approach: only use version-compatible configurations

Then verify:

openclaw doctor --fix

And check the startup logs to confirm compaction was accepted.

P1—Reduce reasoning text persistence

Avoid long reasoning text being replayed repeatedly

In the production environment: save short summaries, not full reasoning

P2—Improve prompt caching design

The goal is not to maximize cacheRead. The goal is to use the cache on a compact, stable, high-value prefix.

Recommendations:

· Place stable rules in the system prompt

· Do not place unstable data in the stable prefix

· Avoid injecting large amounts of debug data every round

Operational Stop-Loss Plan (if I had to deal with it tomorrow)

1. Identify sessions with the highest cacheRead ratios
2. Execute /compact on runaway sessions
3. Add truncation + artifacting to tool outputs
4. Re-run token statistics after each modification

Focus on tracking four KPIs:

cacheRead / totalTokens

toolUse avgTotal/call

>=100k token call counts

Percentage of maximum session

Successful Signals

If the optimizations work, you should see:

A significant reduction in calls of 100k+ tokens

A decrease in cacheRead ratio

A decrease in toolUse call weight

A reduction in the dominance of individual sessions

If these metrics do not change, it indicates that your context strategy is still too loose.

Replication Experiment Command

python3 scripts/session_token_breakdown.py 'sessions' \
--include-deleted \
--top 20 \
--outlier-threshold 120000 \
--json-out tmp/session_token_stats_v2.json \
> tmp/session_token_stats_v2.txt

python3 scripts/session_duplicate_waste_analysis.py 'sessions' \
--include-deleted \
--top 20 \
--png-out tmp/session_duplicate_waste.png \
--json-out tmp/session_duplicate_waste.json \
> tmp/session_duplicate_waste.txt

Conclusion

If your Agent system seems to be running normally, but costs are continually rising, you might want to check one question: Are you paying for new reasoning, or for the large-scale replay of old contexts?

In my case, the vast majority of costs actually stem from context replay.

Once you realize this, the solution becomes clear: strictly control the data entering long-term context.

[Original Link]

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

OpenClaw Money-Saving Guide: Save Twenty Thousand a Month, What Did I Do Right?