AI agent outputs garbage? The problem is that you are reluctant to burn tokens.

The problem is not in the prompt!

Author: Systematic Long Short

Translated by: Deep Tide TechFlow

Deep Tide Introduction: The core argument of this article can be summed up in one sentence: The quality of AI Agent output is proportionate to the number of tokens you invest.

The author is not discussing theories vaguely but has provided two specific methods that can be used starting today, clearly delineating the boundary where tokens cannot pile up — the "novelty problem."

This has high information density and operability for readers who are currently using agents to write code or run workflows.

Introduction

Well, you have to admit this title is indeed eye-catching — but seriously, this is not a joke.

In 2023, while we were still using LLMs to produce production code, those around us were shocked because the common perception at that time was that LLMs could only produce worthless garbage. But we knew something others did not realize: the output quality of the agent is a function of the number of tokens you invest. It's as simple as that.

You can see this for yourself by running a few experiments. Task an agent with a complex and somewhat obscure programming task — for example, implementing a constrained convex optimization algorithm from scratch. Start with the lowest reasoning tier; then switch to the highest reasoning tier, let it review its own code, and see how many bugs it can discover. Try mid-tier and high-tier as well. You will intuitively observe: the number of bugs monotonically decreases with the amount of tokens invested.

This is not hard to understand, right?

More tokens = fewer errors. You can push this logic a step further; this is basically the (simplified) core idea behind code review products. In a completely new context, invest massive tokens (e.g., have it parse the code line by line to determine if each line has bugs) — this can basically catch most, if not all, bugs. You can repeat this process ten times, a hundred times, examining the codebase from "different angles," and you will ultimately be able to unearth all the bugs.

The viewpoint that "burning more tokens can improve agent quality" has empirical support as well: teams that claim to write code entirely with agents and directly push it to production are either the foundational model providers themselves or extremely well-funded companies.

So, if you are still troubled by the fact that agents cannot produce production-level code — to put it bluntly, the problem lies with you. Or rather, with your wallet.

How to Determine if I am Burning Enough Tokens

I have written an entire article stating that the problem definitely does not lie in the framework you are using; "keeping it simple" can still yield excellent results, and I still stand by this view. You read that article, did what it said, but you are still greatly disappointed with the agent's output. You sent me a DM, I saw it but didn't reply.

This article is my reply.

Your agent performs poorly and cannot solve problems, in most cases, it is because you are not burning enough tokens.

The number of tokens required to solve a problem entirely depends on the problem's scale, complexity, and novelty.

"What is 2+2?" does not require many tokens.

"Help me write a bot that can scan all markets between Polymarket and Kalshi, find semantically similar markets that should settle before and after the same event, set no-arbitrage boundaries, and automatically trade with low latency once arbitrage opportunities arise" — this requires burning a ton of tokens.

We discovered something interesting in practice.

If you invest enough tokens to handle problems caused by scale and complexity, the agent can solve them no matter what. In other words, if you want to build something extremely complex with many components and lines of code, as long as you throw enough tokens at these problems, they can eventually be thoroughly resolved.

There is one small but important exception.

Your problem cannot be too novel. At this stage, no amount of tokens can solve the "novelty" problem. While enough tokens can reduce errors caused by complexity to zero, they cannot enable an agent to invent something it does not know out of thin air.

This conclusion is actually a relief for us.

We have spent a tremendous amount of effort, burning — a lot, a lot of — tokens to see if we could recreate institutional investment processes with almost no guidance. Part of the reason is to figure out how many years we (as quant researchers) are from being completely replaced by AI. The result showed that agents simply cannot come close to a decent institutional investment process. We believe part of the reason is that they have never seen anything like it — in other words, institutional investment processes do not exist in the training data.

So, if your problem is novel, do not expect to solve it just by piling on tokens. You need to guide the exploration process yourself. But once you have determined a solution, you can safely pile tokens on to execute — no matter how large the codebase is or how complex the components are, it won't be a problem.

Here is a simple heuristic principle: The token budget should grow proportionally to the number of lines of code.

What Burning More Tokens Actually Does

In practice, additional tokens typically enhance the engineering quality of the agent in several ways:

Allow it to spend more time reasoning in the same attempt, giving it the opportunity to discover erroneous logic on its own. Deeper reasoning = better planning = higher probability of hitting it on the first attempt.

Allow it to make multiple independent attempts, taking different problem-solving paths. Some paths are better than others. Allowing more than one attempt enables it to select the optimal one.

Similarly, more independent planning attempts allow it to discard weak directions and retain the most promising ones.

More tokens allow it to critique its previous work in a brand new context, giving it an opportunity for improvement instead of being stuck in some "reasoning inertia."

Of course, there's my favorite point: more tokens mean it can use tests and tools for verification. Running the actual code to see if it executes correctly is the most reliable way to confirm the answer is correct.

This logic works because the engineering failures of agents are not random. They are almost always caused by selecting the wrong path too early, not checking if this path is genuinely viable (in the early stages), or not having enough budget to recover and backtrack after discovering errors.

That's the story. Tokens literally represent the quality of decisions you purchase. Think of it as research work: if you let a person answer a difficult question on the spot, the quality of the answer will decrease as time pressure increases.

Research, after all, is about generating the foundational "knowing the answer." Humans spend biological time producing better answers, while agents spend more computational time producing better answers.

How to Improve Your Agent

You may still be skeptical, but there are many papers that support this. To be honest, the existence of the "reasoning" adjustment knob itself is all the proof you need.

One paper I particularly liked trained a small batch of carefully curated reasoning samples and then employed a method to force the model to continue thinking when it wanted to stop — specifically, by appending "Wait" at the point where it wanted to stop. Just this one change raised a certain benchmark from 50% to 57%.

I want to put it as plainly as possible: If you’ve been complaining about the code written by the agent being mediocre, the highest single reasoning tier is likely still not enough for you.

Here are two very simple solutions.

Simple Approach One: WAIT

The simplest thing you can start doing today: set up an automatic loop — after building it, have the agent review N times in a brand new context, fixing any problems discovered each time.

If you find that this simple technique improves your agent's engineering performance, then at least you understand that your problem is merely the number of tokens — then come join the token-burning club.

Simple Approach Two: VERIFY

Have the agent validate its work as early and frequently as possible. Write tests to prove that the chosen path indeed works. This is especially useful for highly complex and deeply nested projects — one function may be called by many other functions downstream. Catching errors upstream can save you a lot of subsequent computational time (tokens). So if possible, set up "verification checkpoints" throughout the entire build process.

Once you've completed a segment, the main agent says it's done? Have a second agent validate it again. Unrelated streams of thought can cover sources of systematic bias.

That's basically it. I could write much more on this topic, but I believe that just realizing these two things and executing them well can help you solve 95% of your problems. I firmly believe in perfecting simple things to the extreme and then adding complexity as needed.

I mentioned that "novelty" is a problem that cannot be solved with tokens, and I want to emphasize this again because you will inevitably encounter this pitfall and then come to me crying that piling on tokens is useless.

When the problem you are trying to solve is not in the training set, you are the one who really needs to provide the solution. Therefore, domain expertise remains extremely important.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。