Original Title: Thin Harness, Fat Skills
Original Author: Garry Tan
Translated by: Peggy, BlockBeats

Editor's Note: When "stronger models" become the industry's default answer, this article provides a different judgment: what truly creates a productivity gap of 10 times, 100 times, or even 1000 times is not the model itself, but the whole system design built around the model.

The author of this article, Garry Tan, is currently the president and CEO of Y Combinator, with a long-standing focus on AI and the early startup ecosystem. He proposes the framework of "fat skills + thin harness," breaking down AI applications into key components such as skills, operational frameworks, context routing, task division, and knowledge compression.

Within this system, models are no longer the entirety of capabilities but merely execution units within the system; the real determinants of output quality are how you organize context, solidify processes, and clarify the boundaries between "judgment" and "computation."

More importantly, this approach is not just a conceptual one but has been validated in real scenarios: faced with data processing and matching tasks from thousands of entrepreneurs, the system achieves capabilities close to that of human analysts through a cycle of "read-aggregate-judge-write-back," continuously optimizing itself without the need to rewrite code. This "learning system" transforms AI from a one-time tool into a foundational infrastructure with compound returns.

Thus, the core reminder given in the article becomes clear: in the age of AI, the efficiency gap no longer depends on whether you use the most advanced model but on whether you build a system capable of continuously accumulating capabilities and evolving automatically.

Below is the original text:

Steve Yegge said that people using AI programming agents have "efficiency that is 10 to 100 times that of engineers who only write code using cursors and chat tools, roughly 1000 times that of Google engineers in 2005."

Note: Steve Yegge is a highly influential software engineer, technical blogger, and engineering culture commentator in Silicon Valley, known for his sharp, lengthy, and highly personal style of technical writing. He has served as a senior engineer at companies like Amazon and Google, later joining Salesforce, and moving on to startups related to AI; he is also one of the early promoters of the Dart project.

This is not an exaggeration. I have seen it firsthand and experienced it myself. However, when people hear about such disparities, they often mistakenly attribute it to the wrong factors: stronger models, smarter Claude, more parameters.

In reality, the person who achieves a 2 times efficiency increase and the one who achieves a 100 times increase are using the same model. The difference lies not in "intelligence," but in "architecture," and this architecture is simple enough to be written on a card.

Harness is the product itself.

On March 31, 2026, an accident at Anthropic resulted in the complete source code for Claude Code being released on npm—a total of 512,000 lines. I read through it. This confirmed what I have been saying in YC (Y Combinator): the real secret is not in the model but in the "layer that wraps the model."

Real-time code repository context, prompt caching, tools designed for specific tasks, compressing redundant context as much as possible, structured conversation memory, and parallel running sub-agents—none of these make the model smarter. But they provide the model with the "right context at the right time," while preventing it from being overwhelmed by irrelevant information.

This layer of "wrapping" is called harness. And the real question every AI builder should ask is: what should be put into the harness, and what should be left outside?

This question actually has a very specific answer—I call it: thin harness, fat skills.

Five definitions

The bottleneck is never in the model's intelligence. The model actually knows how to reason, synthesize information, and write code a long time ago.

It fails because it does not understand your data—your schema, your conventions, what shape your specific problem takes. And the following five definitions exist precisely to solve this problem.

1. Skill file

A skill file is a reusable markdown document that teaches the model "how to do something." Note, it does not tell it "what to do"—that part is provided by the user. The skill file provides the process.

The key point that most people overlook is that a skill file is actually like a method call. It can accept parameters. You can call it with different parameters. The same process can exhibit entirely different capabilities based on the parameters provided.

For example, there is a skill called /investigate. It contains seven steps: define data scope, build timeline, diarize each document, synthesize, argue from both sides, and cite sources. It accepts three parameters: TARGET, QUESTION, and DATASET.

If you point it to a security scientist and 2.1 million forensic emails, it will become a medical research analyst, judging whether a whistleblower has been suppressed.

If you point it to a shell company and the Federal Election Commission (FEC) filing documents, it will turn into a legal forensic investigator tracking collusive political donations.

It's still the same skill. The same seven steps. The same markdown file. The skill describes a judgment process, and what truly brings it to reality is the parameters passed at invocation.

This is not prompt engineering but software design: only here, markdown is used as the programming language, and human judgment acts as the runtime environment. In fact, markdown is even more suitable for encapsulating capabilities than rigid source code, as it describes processes, judgments, and context, which are precisely the language that models "understand" best.

2. Harness

The harness is the layer of programs that drives the LLM to operate. It does four things: runs the model in a loop, reads and writes your files, manages context, and enforces safety constraints.

That's it. This is "thin."

The reverse pattern is: fat harness, thin skills.

You must have seen such things: over 40 tool definitions, half of the context window consumed just by explanations; an all-encompassing God-tool that takes 2 to 5 seconds for an MCP round trip; or packing each REST API endpoint into a separate tool. The result is that token usage triples, latency triples, and failure rates also triple.

The truly ideal approach is to use tools that are purpose-built, fast, and narrowly functional.

For example, a Playwright CLI, where each browser operation only takes 100 milliseconds; instead of a Chrome MCP, which takes 15 seconds for a single screenshot → find → click → wait → read. The former is 75 times faster.

Modern software does not need to be "meticulously crafted to the point of bloat" anymore. What you should do is: build only what you really need, and nothing more.

3. Resolver

A resolver is essentially a context routing table. When task type X appears, prioritize loading document Y. Skills tell the model "how to do"; resolvers tell the model "when to load what."

For instance, if a developer modifies a prompt. Without a resolver, they might directly push after making changes. With a resolver, the model will first read docs/EVALS.md. And this document states: run the evaluation suite first and compare the scores before and after; if accuracy decreases by more than 2%, rollback and investigate the cause. The developer may not even have known that an evaluation suite existed. It was the resolver that loaded the correct context at the right moment.

Claude Code has a built-in resolver. Each skill has a description field, and the model automatically matches user intent with the skill description. You do not need to remember whether the /ship skill exists—the description itself acts as a resolver.

To be frank: my previous CLAUDE.md had a whopping 20,000 lines. All the quirks, all the patterns, all the lessons I had encountered were stuffed in there. Ridiculous. The model's attention quality clearly declined. Claude Code even directly allowed me to remove it.

The final fix probably only needed about 200 lines—just retaining a few document pointers. Whichever document is truly needed will be loaded by the resolver at the critical moment. As a result, the 20,000 lines of knowledge can still be accessed as needed, without polluting the context window.

4. Latent and deterministic

Your system has every step belonging either to this type or that type. Confusing these two is the most common mistake in agent design.

·Latent space is where intelligence resides. The model reads, understands, judges, and decides here. What is processed here is: judgment, synthesis, pattern recognition.

·Deterministic is where reliability resides. The same input always yields the same output. SQL queries, compiled code, arithmetic operations all belong to this side.

An LLM can help you arrange a dinner seating for 8 people, considering each person’s character and social relationships. But if you ask it to seat 800 people, it will seriously fabricate a seating chart that "looks reasonable but is actually completely wrong." Because that is no longer a problem for the latent space to handle, but a deterministic problem that has been awkwardly forced into the latent space—a combinatorial optimization problem.

The worst systems consistently place work incorrectly on both sides of this dividing line. The best systems will ruthlessly delineate the boundaries.

5. Diarization

The diarization step is the key that truly allows AI to generate value from real-world knowledge work.

It means that the model reads all materials related to a topic and then writes out a structured picture. On one page, it condenses judgments from dozens or even hundreds of documents.

This is not something that can be produced by an SQL query. Nor can a RAG pipeline produce this. The model must truly read, hold contradictory information in mind simultaneously, notice what changes have occurred and when, and then synthesize this content into structured intelligence.

This is the difference between a database query and an analyst's briefing.

This architecture

These five concepts can form a very simple three-layer architecture.

·The top layer is fat skills: a process written in markdown that carries judgment, methodology, and domain knowledge. 90% of the value lies in this layer.
·The middle layer is a thin CLI harness: about 200 lines of code, input JSON, output text, default read-only.
·The bottom layer is your application system: QueryDB, ReadDoc, Search, Timeline—these are deterministic infrastructures.

The core principle is directional: push "intelligence" as far up to skills as possible; push "execution" as far down to deterministic tools; keep the harness light and thin.

The result of this is: whenever model capabilities improve, all skills automatically become more powerful; while the underlying deterministic system remains stable and reliable.

A learning system

Next, I will use a real system we are building in YC to demonstrate how these five definitions work together.

In July 2026, Chase Center. Startup School had 6000 founders participating. Each had structured application materials, responses to questionnaires, transcripts of 1:1 conversations with mentors, and public signals: posts on X, GitHub submission records, and logs of Claude Code usage (which show their development speed).

The traditional approach would involve a project team of 15 people individually reading applications, judging instinctively, and then updating a spreadsheet.

This method works at a scale of 200 people but completely fails at 6000. No human can hold so many images in their mind simultaneously and realize that the three best candidates for AI agent infrastructure direction are the founder of a development tool in Lagos, a compliance entrepreneur in Singapore, and a CLI tool developer in Brooklyn—who, in different 1:1 conversations, described the same pain point using completely different expressions.

The model can do it. Here’s how:

Enrichment

There is a skill called /enrich-founder, which pulls all data sources to perform information enrichment, diarization, and highlight the differences between "what the founder says" and "what is actually done."

The underlying deterministic system is responsible for: SQL queries, GitHub data, browser tests on demo URLs, social signal scraping, CrustData queries, etc. A scheduled task runs once a day. The portraits of 6000 founders are kept up to date.

The output of diarization can capture information that keyword searches completely fail to discover:

Founder: Maria Santos Company: Contrail (contrail.dev) Self-description: "Datadog for AI agents" Actually doing: 80% of code commits focused on the billing module → Essentially creating a FinOps tool disguised as observability

This "statement vs actual behavior" discrepancy requires simultaneously reading GitHub submission history, application materials, and conversation records, and integrating them in the mind. No embedding similarity search can accomplish this, nor can keyword filtering. The model must read completely and then make judgments. (This is the task that should be placed in latent space!)

Matching

This is where the "skill = method call" exerts its power.

The same matching skill, when called three times, can yield completely different strategies:

/match-breakout: processes 1200 people, clustering by field, with 30 people per group (embedding + deterministic allocation)

/match-lunch: processes 600 people, "random matching" across fields, 8 people per table without repetition—LLM generates the topic first, then a deterministic algorithm arranges the seating

/match-live: processes live participants, matching 1 to 1 based on nearest neighbor embedding within 200ms, excluding people already seen

And the model can make judgments that traditional clustering algorithms cannot accomplish:

"Santos and Oram both belong to AI infrastructure, but they are not competitors—Santos does cost attribution, while Oram does orchestration. They should be placed in the same group."
"Kim wrote developer tools in his application, but 1:1 conversation shows he is doing SOC2 compliance automation. Should be reclassified to FinTech / RegTech."

This reclassification is completely uncatchable by embedding. The model must read the entire portrait.

Learning loop

After the event concludes, a /improve skill reads the NPS survey results, performs diarization on those "okay" feedback—not poor reviews, but those "almost good" ones—and extracts patterns.

Then, it proposes new rules and writes them back into the matching skill:

When participants say "AI infrastructure," but over 80% of their code is for the billing module:
→ classify as FinTech, not AI Infra

When two people in the same group already know each other:
→ lower matching weight
Prioritize introducing new relationships

These rules will be written back to the skill file. They automatically take effect the next time it runs. The skills are in "self-rewriting." In July's event, the "okay" ratings accounted for 12%; in the next event, it dropped to 4%.

The skill file learned what "okay" means, and the system became better without anyone rewriting code.

This pattern can be transferred to any field:

Retrieve → Read → Diarize → Count → Synthesize

Then: Survey → Investigate → Diarize → Rewrite skill

If you want to ask what the most valuable loop in 2026 is, it is this set. It can be applied to almost all knowledge work scenarios.

Skills are permanently upgraded

Recently, I posted a directive to OpenClaw on X, which received more response than expected:

Prompt: You are not allowed to do one-time work. If I ask you to do something that will be repeated in the future, you must: First manually process 3 to 10 samples and show me the results; If I approve, write it into a skill file; If it should run automatically, add it to the scheduled task. The criterion is: if I need to ask again, it means you have failed.

This content received thousands of likes and over two thousand saves. Many people thought this was a trick of prompt engineering.

In fact, it is not; it is the architecture discussed earlier. Every skill you write is a permanent upgrade to the system. It will not degrade, nor will it forget. It will run automatically at three in the morning. And when the next generation of models is released, all skills will instantly become stronger— judgment ability in the latent part will improve while the deterministic part remains stable and reliable.

This is the source of the 100 times efficiency that Yegge spoke of.

It is not a smarter model, but rather: fat skills, thin harness (Thin Harness, Fat Skills), and a discipline that solidifies everything into capability.

The system will grow exponentially. Build once, run for the long term.

[Original Link]

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

Thin Harness, Fat Skill: The True Source of 100 Times AI Productivity