Original Title: Why My OpenClaw Sessions Burned 21.5M Tokens in a Day (And What Actually Fixed It)
Original Author: MOSHIII
Translation: Peggy, BlockBeats
Editor’s Note: In the current rapid popularization of Agent applications, many teams have discovered a seemingly abnormal phenomenon: the system runs perfectly fine, but token costs have been continuously rising unknowingly. This article analyzes a real OpenClaw workload and finds that the reasons for cost explosions often do not come from user inputs or model outputs, but from overlooked context cache replay. The model repeatedly reads a massive historical context in each round of calls, resulting in huge token consumption.
The article combines specific session data to demonstrate how tool outputs, browser snapshots, JSON logs, and other large intermediate products are constantly written into historical context and repeatedly read in the agent loop.
Through this case, the author proposes a clear optimization idea: from context structure design, tool output management to compaction mechanism configuration. For developers building Agent systems, this is not only a technical troubleshooting record but also a money-saving strategy.
The following is the original text:
I analyzed a real OpenClaw workload and discovered a pattern that I believe many Agent users would recognize:
Token usage looks very "active"
Responses also appear quite normal
But token consumption suddenly skyrockets
Below are the structural breakdown, root causes, and feasible repair paths from this analysis.
TL;DR
The biggest cost driver is not that user messages are too long. Rather, it's the massive cached prefix being repeatedly replayed.
From session data:
Total tokens: 21,543,714
cacheRead: 17,105,970 (79.40%)
input: 4,345,264 (20.17%)
output: 92,480 (0.43%)
In other words: The cost of most calls is not actually from processing new user intentions, but from repeatedly reading a vast historical context.
"Wait, how could this happen?" moment
I initially thought the high token usage came from: very long user prompts, a large amount of output generation, or expensive tool calls.
But the real dominant pattern is:
input: hundreds to thousands of tokens
cacheRead: 170,000 to 180,000 tokens per call
That is to say, the model is repeatedly reading the same massive stable prefix every round.
Data Scope
I analyzed data from two levels:
1. Runtime logs
2. Session transcripts
It should be noted that:
Runtime logs are primarily used to observe behavior signals (such as restarts, errors, configuration issues)
Accurate token statistics come from the usage field in session JSONL
Scripts used:
scripts/session_token_breakdown.py
scripts/session_duplicate_waste_analysis.py
Generated analysis files:
tmp/session_token_stats_v2.txt
tmp/session_token_stats_v2.json
tmp/session_duplicate_waste.txt
tmp/session_duplicate_waste.json
tmp/session_duplicate_waste.png
Where is the actual token consumption?
1) Session Concentration
There is one session that consumed far more than the others:
570587c3-dc42-47e4-9dd4-985c2a50af86: 19,204,645 tokens
Then there is a significant drop:
ef42abbb-d8a1-48d8-9924-2f869dea6d4a: 1,505,038
ea880b13-f97f-4d45-ba8c-a236cf6f2bb5: 649,584
2) Behavior Concentration
Tokens primarily come from:
toolUse: 16,372,294
stop: 5,171,420
This indicates that the issue mainly lies within tool call chain loops, rather than normal chatting.
3) Time Concentration
Token peaks are not random but are concentrated in several time periods:
2026-03-08 16:00: 4,105,105
2026-03-08 09:00: 4,036,070
2026-03-08 07:00: 2,793,648
What's inside the massive cached prefix?
It is not the dialogue content, but mainly large intermediate products:
Huge toolResult data blocks
Long reasoning/thinking traces
Large JSON snapshots
File lists
Browser scraping data
Dialogue records of sub-Agents
In the largest session, the character counts are approximately:
toolResult:text: 366,469 characters
assistant:thinking: 331,494 characters
assistant:toolCall: 53,039 characters
Once these contents are retained in the historical context, subsequent calls may read them again through the cache prefix.
Specific Example (from session files)
Large chunks of context repeatedly appeared in the following locations:
sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:70
Large gateway JSON logs (about 37,000 characters)
sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:134
Browser snapshots + Secure encapsulation (about 29,000 characters)
sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:219
Huge file list outputs (about 41,000 characters)
sessions/570587c3-dc42-47e4-9dd4-985c2a50af86.jsonl:311
session/status state snapshots + large prompt structures (about 30,000 characters)
"Duplicate Content Waste" vs "Cache Replay Burden"
I also measured the proportion of duplicate content within single calls:
Duplicate ratio approximately: 1.72%
There is indeed some, but it is not the main issue.
The real problem is: the absolute size of the cached prefix is too large
The structure is: huge historical context, re-reading for each call, with only a small amount of new input added on top.
Thus, the focus of optimization should not be deduplication, but context structure design.
Why is this problem particularly easy to occur in Agent loops?
Three mechanisms overlap:
1. A large amount of tool output is written into historical context
2. Tool loops produce a large number of short interval calls
3. The prefix varies little → the cache will read them again each time
If context compaction is not triggered stably, the problem will amplify rapidly.
Most Important Repair Strategies (sorted by impact)
P0—Do not shove large tool outputs into long-term context
For super-large tool outputs:
· Keep summaries + reference paths/IDs
· Write the original payload into file artifacts
· Do not keep full originals in chat history
Prioritize the limitation of these categories:
· Large JSON
· Long directory lists
· Full browser snapshots
· Full transcripts of sub-Agents
P1—Ensure the compaction mechanism is truly effective
In this data, compatibility issues with configurations appeared multiple times: invalid compaction key
This can quietly disable optimization mechanisms.
The correct approach: only use version-compatible configurations
Then verify:
openclaw doctor --fix
And check the startup logs to confirm compaction was accepted.
P1—Reduce reasoning text persistence
Avoid long reasoning text being replayed repeatedly
In the production environment: save short summaries, not full reasoning
P2—Improve prompt caching design
The goal is not to maximize cacheRead. The goal is to use the cache on a compact, stable, high-value prefix.
Recommendations:
· Place stable rules in the system prompt
· Do not place unstable data in the stable prefix
· Avoid injecting large amounts of debug data every round
Operational Stop-Loss Plan (if I had to deal with it tomorrow)
1. Identify sessions with the highest cacheRead ratios
2. Execute /compact on runaway sessions
3. Add truncation + artifacting to tool outputs
4. Re-run token statistics after each modification
Focus on tracking four KPIs:
cacheRead / totalTokens
toolUse avgTotal/call
>=100k token call counts
Percentage of maximum session
Successful Signals
If the optimizations work, you should see:
A significant reduction in calls of 100k+ tokens
A decrease in cacheRead ratio
A decrease in toolUse call weight
A reduction in the dominance of individual sessions
If these metrics do not change, it indicates that your context strategy is still too loose.
Replication Experiment Command
python3 scripts/session_token_breakdown.py 'sessions' \
--include-deleted \
--top 20 \
--outlier-threshold 120000 \
--json-out tmp/session_token_stats_v2.json \
> tmp/session_token_stats_v2.txt
python3 scripts/session_duplicate_waste_analysis.py 'sessions' \
--include-deleted \
--top 20 \
--png-out tmp/session_duplicate_waste.png \
--json-out tmp/session_duplicate_waste.json \
> tmp/session_duplicate_waste.txt
Conclusion
If your Agent system seems to be running normally, but costs are continually rising, you might want to check one question: Are you paying for new reasoning, or for the large-scale replay of old contexts?
In my case, the vast majority of costs actually stem from context replay.
Once you realize this, the solution becomes clear: strictly control the data entering long-term context.
免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。