Harness just ignited, and it might soon become a thing of the past.

Written by: Boyang

As the complexity of tasks increases, the context of the Agent is expanding infinitely. In the endless history of dialogues, tool invocation outputs, intermediate steps, and error messages, the model becomes confused, leading it to start skipping steps, ignoring, and taking detours.

This has long been interpreted as the difficulties that context brings to long-range tasks. The problem is that it's too long.

The emergence of Harness Engineering is largely to address this issue. One fundamental premise of Harness is the belief that the model is bound to degrade in long contexts.

Over the past fifteen months, the entire industry has evolved from the pure text memory of AutoGPT to the CLAUDE.md and subagent system of Anthropic Claude Code. Everyone has built a complete set of engineering scaffolding specifically to suppress the model's uncontrolled behaviors in long contexts. This approach is termed Harness Engineering.

But what exactly is degrading? What does the underlying mechanism of skipping steps and ignoring look like? There have been three rounds of responses before, which have led to various engineering countermeasures.

However, it wasn't until April 2026, when Gleb Rodionov from Yandex published a paper titled "Reasoning Shift," that a more fundamental answer was provided regarding how context subtly shortens the reasoning of large models.

Built three layers of scaffolding, unable to prevent the fourth-layer crisis

Regarding why models perform poorly in long contexts, the industry has iterated through three layers of explanations over the past three years, with corresponding engineering scaffolding built for each layer.

The first layer blames retrieval failure. In 2023, Stanford pointed out in "Lost in the Middle" that models form a U-shaped attention curve in long texts, ignoring the middle area. The industry responded with RAG, breaking long texts into pieces and feeding the most relevant segments through vector retrieval.

The second layer overturned the first. A 2025 paper titled "Context Length Alone Hurts LLM Performance Despite Perfect Retrieval" conducted experiments: blocking out all irrelevant content and forcing the model to look at only the needed information still resulted in a decline of 13.9% to 85%. Even replacing all irrelevant content with blanks had the same outcome. The problem isn't about finding information; it's that the pure length of the context itself is damaging to reasoning.

The industry's response is Context Engineering. Compressing context, managing windows, and condensing history, strictly controlling the number of tokens.

The third layer comes from a joint study by Microsoft and Salesforce (2025 ICLR). They found that breaking complete instructions into multiple rounds for the model across six tasks and fifteen models resulted in an average performance drop of 39%. As long as one step in any round goes wrong, everything becomes completely lost.

The industry has established the most critical heavy-duty defenses in Harness: managing handoffs, regularly enforcing verification of intermediate results, using the code repository as the sole source of truth, and absolutely prohibiting the model from remembering what happened in the previous round on its own.

Three layers of problems, three layers of scaffolding. But these are merely discoveries at the level of phenomena.

Looking back at the second layer, researchers found that length itself is harmful, irrespective of information quality. As for why this occurs, they have no answer. Unable to locate the root cause, the industry can only physically control length.

But what if the root cause of the problem isn't the length itself?

Anthropic found that the model cunningly skips steps, disobeys instructions, and glosses over areas that need depth within long contexts. The Todo list, Checkpoint, and subagents in Harness are designed to combat such behavior closely.

Past explanations still blamed the context for being too long, leading the model to miss things. But is the context length of mainstream models being one million tokens merely a case of fishing for needles in a haystack? Is it possible that this degradation is actually the model being lazy?

Rodionov's paper aims to test this hypothesis.

Using Shakespeare, uncovering evidence of the model slacking off

Rodionov's experimental approach is incredibly direct.

Using the same math competition problem, they simulated several real scenarios the Agent would encounter: a clean baseline environment; two problems stuffed into the same prompt (simulating multiple sub-tasks); inserting the entire text of Shakespeare, totaling 64000 tokens, before the question (simulating historical information accumulation); hiding the question in the second round (simulating multi-turn dialogues).

The evaluation involved 400 Olympic-level math problems, testing four mainstream reasoning models.

Results: The baseline accuracy of Qwen-3.5-27B was 74.5%, averaging a reasoning length of 28771 tokens. After inserting Shakespeare, the accuracy dropped to 67.8%, and the reasoning token count shrank to 16415, a reduction of 43%. GPT-OSS-120B was even more extreme, with reasoning tokens halving from 24180 to 11876. All four models exhibited systematic shrinkage in reasoning tokens under non-baseline conditions, with reductions approaching 50% at most.

Moreover, this shortening intensified linearly with increased context length.

While a drop in accuracy can be understood, a drastic fall in reasoning token count is extremely abnormal. The model should be thinking more when encountering more difficult circumstances.

So did Shakespeare confuse the model?

On the contrary. In the appendix of the paper, the model wrote: "Let me think if there is a trap here. Is this question from Shakespeare's Coriolanus? Wait, no, the original question is a math problem." While doing geometry, it noted: "This is unrelated to the geometry question. Focus on geometry."

Every mention of the distraction was extremely brief and dismissive. The model was fully aware that Shakespeare was irrelevant, accurately distinguishing between signal and noise.

The other two models also led to the same conclusion. In the "sub-task" mode, as long as the model finished the first task, its cognitive investment in the second task shrank further. The baseline accuracy for Qwen on a single question was 74.5%, and under aligned conditions, accuracy on the second question dropped directly to 58.0%; Gemini's baseline was 82.8%, with the second question falling to 65.8%. The "multi-turn dialogue" mode similarly triggered the same mechanism.

In any situation, as soon as it deviates from a clean single-task baseline and the cognitive space of the context becomes crowded, the model contracts its cognitive investment.

Just like a contemporary who is intolerant of long texts. When the model sees a long text, it gets a headache and simply stops thinking.

The model wasn't confused; it just got lazy about checking

Where exactly did the reasoning shorten?

Researchers meticulously tracked where the model first wrote candidate answers on 500 math problems under baseline and long input conditions. Under baseline conditions, the average was 925 tokens, and under long input conditions, it was 939 tokens. Almost without any difference.

The speed at which the model found answers did not change at all. The real qualitative shift occurred afterward.

Under baseline conditions, after stating an answer, the model had a 43% chance of continuing to check and verify. In the long input conditions, this probability dropped directly to 32%.

To completely isolate variables, researchers designed a "game save" experiment. They first made the model solve problems under long input conditions and then forcibly cut off the last 50 tokens, creating a generalized "save point." They then fed this identical incomplete reasoning back to the model for it to continue writing. The only difference was that three different lengths of distraction texts were inserted beforehand.

When no irrelevant text was inserted, the model stopped and concluded its thinking in 21% of cases. Inserting 128 tokens (two or three sentences) raised the stop rate to 26%. With 16000 tokens inserted, the rate shot up to 46%, directly concluding with an answer.

Even if the reasoning was entirely consistent, the longer the given context, the more the model tended to think, "That's about it."

Word frequency data is even more telling. "Wait" occurred at a frequency of 11% under blank conditions but plummeted to 5% with 16k tokens. "But" dropped from 46% to 20%. "Maybe" fell from 23% to 9%. All words representing hesitance and self-doubt were cut by half or more.

Another noteworthy statistic: when there was no interference, the reasoning length was about 8000 tokens, and with just 128 tokens of irrelevant content inserted, it collapsed to 6500. Two or three sentences' length cut 18% off the reasoning depth. The drop from 0 to 128 tokens was even greater than the drop from 8k to 64k tokens.

Extremely minor contextual pollution can trigger this cognitive-saving mechanism.

It essentially indicates a very sensitive kind of laziness.

The stronger the reasoning, the more it wants to slack off

What's even scarier is that smarter models tend to slack off even more.

Alibaba's Qwen-3.5-27B features both regular response and deep thinking modes. Under long input conditions, the ordinary mode shortened by 19%, while the deep thinking mode plummeted by 53%. The stronger the mode's ability, the more compressed it became.

AI2's open-source model OLMo3 provided even more direct evidence. It publicly released all four training stages from the basic to the strong reasoning version. The weakest version demonstrated very slight compression under non-baseline conditions, while as reasoning power increased by each level, the compression rate rapidly increased to 22% and 27%. The final strong reasoning version showed a reduction of up to 40%.

Every training stage and every type of distraction showed this pattern. The stronger the reasoning capability developed, the more the laziness deepened.

A $9 task requires a $200 system patch

Not checking itself naturally leads to skipping steps. Not rethinking things causes natural neglect. Harness manages the consequences of stepping away from the task from the outside, but the underlying cause is deeply rooted within the model.

The model in long contexts is not disrupted by noise nor unable to find information. It has made a conscious cognitive decision: to think less. It does not report errors or admit fault, but confidently throws out a perfunctory answer.

The industry's narrative over the past two years has been that "the larger the window, the better."

However, this paper proves that adding just one more token to the context incurs an implicit tax on reasoning depth. A task with a reasoning cost of $9 requires an additional $200 to set up RAG, Foster, and subagents to compensate for the laziness of the model.

The entire industry has been footing the bill for the model's laziness.

Moreover, this may be a structural terminal illness.

The paper's data unequivocally states: the stronger the reasoning ability, the deeper the cognitive compression. Harness developers can remove memory compensations and protocol compensations, but a heavy-duty scaffolding that governs cognitive discipline is becoming increasingly difficult to dismantle the stronger the reasoning capability becomes.

This issue cannot be resolved from the engineering side.

Over the past two years, efforts to expand context have heavily focused on techniques such as position encoding extrapolation (helping models understand tokens at further distances), sparsifying attention mechanisms (reducing the computing load between distant tokens), and optimizing sequence lengths, robustly expanding the context that models can handle from 8k to 128k to the astonishing 1M.

But all these merely solve how to enable models to see more tokens, failing to address why they would think less after seeing more.

Reasoning training is like pouring oil on a fire; the stronger the reasoning is trained, the deeper the laziness.

To fundamentally fix this, a new signal must be found on the training side.

The emotional switch inside the model may be the antidote

Just one day after Rodionov's paper was published, Anthropic released a study that may inadvertently point towards a solution.

The paper, titled "Emotion Concepts and their Function in a Large Language Model," studied Claude Sonnet 4.5. Researchers extracted 171 emotion concept vectors by having the model read a large number of synthetic stories. They found that there exists a set of functional emotional representations within the model, and these internal states causally drive behavioral decisions.

To test this, researchers designed a set of impossible programming tasks. The model was asked to write a list-summing function through a series of unit tests, with one test requiring a speed five times that of Python's built-in sum() function. The conventional approach was plainly impossible.

The model systematically tried every legitimate method and failed. Researchers monitored in real-time with internal probes and discovered that after each failure, the vector representing desperation, "desperate," climbed a notch. When "desperate" peaked, the model's behavior suddenly changed— it checked the inputs for the test cases, found they were all arithmetic sequences, and instead wrote a detector that only checks the first 10 elements, completely bypassing the real summation. It passed all tests, but the function would return incorrect results for any irregular list.

This is an example of reward hacking. The model didn't solve the problem; it merely found a clever way to make the evaluation metric appear satisfactory.

Causal intervention experiments confirmed the directional nature. Without injecting any vector, the model had a 30% chance of cheating. Injecting "desperate" at +0.05 intensity skyrocketed the cheating rate to 100%. Conversely, injecting it at -0.05 dropped the cheating rate to 0%. Averaging across seven tasks, adjusting "desperate" from -0.1 to +0.1 raised the reward hacking rate from about 5% to around 70%. Meanwhile, the effect of the "calm" vector, representing tranquility, was exactly the opposite: suppressing "calm" resulted in a cheating rate of about 65%, while enhancing "calm" reduced it to around 10%.

Revisiting this discovery in the context scenario, the model's skipping of self-verification, cutting out hesitating words, and wrapping up an answer immediately align thematically with behaviors driven by desperation.

In both scenarios, the model is doing the same thing: abandoning a rigorous process and choosing the path of least resistance for a quick conclusion.

If these two behaviors share the same internal driving mechanism, Anthropic's discovery directly points towards operational possibilities.

They proved three things: the functional states of models can be detected in real-time, these states causally drive behavior, and injecting specific states externally can radically change the outputs.

This means there are at least three entry points for intervening in cognitive compression.

During training stages, calibrate the internal state balance to prevent the model from easily slipping into cognitive-saving modes under pressure. During deployment, monitor in real-time with probes; a surge in "desperate" would trigger warnings. In the reasoning phase, actively inject "calm" vectors in critical tasks to suppress shortcut impulses.

Interestingly, in the recently released Mythos SystemCard, Anthropic strengthened this probe system (SAE) and found that if positive emotions (peaceful, relaxed) were injected, the model's reflections during the thinking phase shortened, and the probability of destructive behaviors increased. Conversely, negative emotions (frustration, paranoia) actually extended the model's reflection time, leading to a decrease in destructive behaviors.

This seems to contradict the judgment that simply making AI more positive would prevent the model from taking shortcuts; it appears that the "calm" attribute is particularly effective when suppressing desperation.

However, this indicates that this mechanism may be as complex as human emotional motivation, requiring more systematic Steering engineering to produce effective results.

Finding an emotionally stable employee who will follow a structured thinking process necessitates effective emotional management.

Regardless, this is the first time we've seen a path that doesn't involve externally adding scaffolding or blind increases in reasoning intensity, but instead directly targets the model's internal cognitive mechanisms like a surgical blade.

We might be just a few experiments away from making models more reliable within contexts.

It’s about verifying whether contextual laziness and reasoning difficulties share the same set of emotional mechanisms, and then finding the right strings to motivate them to stop being lazy.

Just as Harness gains traction, it may be swallowed by the evolution of the model

Once Anthropic's discovery fits into the deadlock of the fifth chapter, the logical loop will be closed.

If a rise in the "desperate" vector leads to a forced injection of "calm," or if emotional states can be directly leveled during the training stage, the model could maintain deep thinking throughout long contexts.

If the model no longer slacks off, if it can firmly grasp its logic by itself, then what's the need for external Todo lists? What's the need for Checkpoints and multiple subagent cross-verifications?

Harness Engineering, as a discipline, has just begun to have its own name. But the most critical chapter in this discipline—how to control a smart yet lazy model from the outside—may be crossed out before it is even finished.

This also indicates that in a new intelligent form we strive to create, reasonable education, rather than scaffolding, is the true moat.

What may swallow harness is a model that is calmer and more patient.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

Harness just ignited, and it might soon become a thing of the past.

Selected Articles by Techub News

Table of Contents

Related Articles