a16z: Large model deployment leads to forgetting; can "continuous learning" break this deadlock?

The breakthrough lies in enabling the model to do the powerful thing of training after deployment: compressing, abstracting, and learning.

Authors: Malika Aubakirova, Matt Bornstein

Compiled by: Shen Chao TechFlow

Shen Chao Guide: Large language models are "frozen" after training, and once deployed, they can only rely on external patches like context windows and RAG to maintain operation, essentially akin to the amnesiac patients in "Memento"—able to retrieve but incapable of genuinely learning new things. Two partners from a16z systematically outlined the cutting-edge research direction of "continuous learning," dismantling this potentially groundbreaking technical track that could redefine the ceiling of AI capabilities along three paths: context, modules, and weight updates.

In Christopher Nolan's "Memento," the protagonist Leonard Shelby lives in a fractured present. Brain damage has left him with anterograde amnesia, making it impossible for him to form new memories. Every few minutes, his world resets, trapped in an eternal "now," unable to recall what just happened and unaware of what is to come. To survive, he tattoos information on his body and uses Polaroids, relying on these external props to replace the memory functions his brain cannot fulfill.

Large language models also exist in a similar eternal present. After training, vast amounts of knowledge are frozen in parameters; the models cannot form new memories or update their parameters based on new experiences. To compensate for this defect, we have built them a scaffolding: chat history acts as short-term notes, retrieval systems as external notebooks, and system prompts akin to tattoos on their surface. Yet the models themselves have never truly internalized this new information.

Increasingly, researchers believe this is insufficient. The problems solved by in-context learning (ICL) presuppose that the answers (or fragments of answers) already exist somewhere in the world. However, for those questions that require genuine discovery (such as new mathematical proofs), adversarial scenarios (such as security offense and defense), or knowledge that is too implicit to be expressed in words, there is ample reason to believe: the model needs a way to directly write new knowledge and experience into parameters after deployment.

In-context learning is temporary. Real learning requires compression. Before we allow the model to continuously compress, we may all be trapped in the eternal present of "Memento." Conversely, if we could train the model to learn its own memory architecture instead of relying on external custom tools, we might unlock an entirely new dimension of scaling.

This research field is called continuous learning. The concept is not new (see the paper by McCloskey and Cohen in 1989), but we believe it is one of the most important research directions in the current AI landscape. The explosive growth in model capabilities over the past two to three years has made the gap between what models "know" and "can know" increasingly apparent. The purpose of this article is to share what we have learned from leading researchers in this field, clarify different paths of continuous learning, and promote the development of this topic within the entrepreneurial ecosystem.

Note: This article's formulation has benefited from deep conversations with a group of outstanding researchers, doctoral students, and entrepreneurs who generously shared their work and insights in the field of continuous learning. Their insights made this article much more robust than if we had written it alone. Thank you for your time and ideas!

Let's Talk Context

Before defending parameter-level learning (i.e., learning that updates model weights), it is necessary to acknowledge a fact: in-context learning does indeed work. Moreover, there is a compelling argument that it will continue to win.

The essence of the Transformer is based on sequentially conditioning the next token predictor. Provide it with the correct sequence, and you can observe surprisingly rich behaviors without ever touching the weights. This is why methods like context management, prompt engineering, instruction fine-tuning, and few-shot examples are so powerful. Intelligence is encapsulated in static parameters, while the exhibited capabilities vary dramatically with the content fed into the window.

Cursor recently published a deep article on scaling autonomous programming agents, which is a good example: the model weights are fixed, and what truly makes the system run is the careful orchestration of context—what gets fed in, when to summarize, and how to maintain coherence over hours of autonomous operation.

OpenClaw is another good example. Its popularity is not due to any special model permissions (the underlying model is accessible to all), but because it efficiently transforms context and tools into working states: tracking what you are doing, structuring intermediate outputs, deciding when to re-inject prompts, and maintaining a persistent memory of previous work. OpenClaw has elevated the "shell design" of agents to the level of an independent discipline.

When prompt engineering first emerged, many researchers were skeptical about the notion that "just using prompts" could be a serious interface. It seemed like a hack. Yet, it is the original output of the Transformer architecture, requiring no retraining, and it automatically improves as the model advances. As the model becomes stronger, prompts become stronger. A "crude but native" interface often wins because it is directly coupled to the underlying system rather than going against it. So far, this has been the trajectory of LLM development.

State Space Models: The Steroid Version of Context

As mainstream workflows shift from raw LLM calls to agent cycles, the pressure faced by context learning models has intensified. In the past, situations where context windows were completely filled were relatively rare. This typically happens when an LLM is asked to accomplish a long list of discrete tasks, and the application layer can directly prune and compress chat history. However, for agents, one task can consume a significant portion of the total available context. Every step of the agent cycles relies on the context passed from previous iterations. They often fail after 20 to 100 steps due to being "disconnected": the context becomes filled, coherence deteriorates, and convergence becomes impossible.

Consequently, major AI labs are now investing massive resources (i.e., large-scale training runs) to develop models with ultra-long context windows. This is a natural path as it builds on already effective methods (context learning) and aligns with the industry trend toward a shift to on-the-fly computation during inference. The most common architectures alternate fixed memory layers, namely state space models (SSMs) and linear attention variants (collectively referred to as SSMs), between ordinary attention heads. SSMs provide fundamentally better scaling curves in long context scenarios.

Figure Caption: Scaling comparison between SSM and traditional attention mechanisms

The aim is to enhance the number of coherent operational steps for agents from approximately 20 to around 20,000 while not losing the broad skills and knowledge provided by traditional Transformers. If successful, this would be a significant breakthrough for long-running agents. You could even view this approach as a form of continuous learning: although there is no update to model weights, an external memory layer is introduced that nearly requires no resetting.

Thus, these non-parametric methods are real and powerful. Any evaluation of continuous learning must begin here. The question is not whether today's context systems are useful; they are indeed useful. The question is: have we hit the ceiling, and can new methods take us further.

What Context Misses: "Filing Cabinet Fallacy"

"What happens with AGI and pre-training is that, in a sense, they are over-tuned... Humans are not AGI. Yes, humans do have a skill base, but they lack massive amounts of knowledge. What we rely on is continuous learning. If I create a super-smart 15-year-old teenager who knows nothing. A good student, very eager to learn. You might say, go be a programmer, go be a doctor. Deployment itself involves some kind of learning, trial and error. It’s a process, not just launching a completed product. —Ilya Sutskever"

Imagine a storage system with infinite capacity. The world's largest filing cabinet, where every fact is perfectly indexed and instantly retrievable. It can look up anything. Has it learned?

No. It has never been forced to compress.

This is the crux of our argument, referencing a point previously made by Ilya Sutskever earlier: LLMs are fundamentally compression algorithms. During training, they compress the internet into parameters. Compression is lossy, and it’s this lossiness that makes it powerful. Compression forces the model to seek structure, generalize, and build representations that can migrate across contexts. A model that memorizes all training samples is less effective than one that extracts underlying patterns. Lossy compression itself is learning.

Ironically, the very mechanism that makes LLMs so powerful during training (compressing raw data into compact, transferable representations) is precisely what we refuse to let them continue doing after deployment. At the moment we release them, we stop compression, substituting it with external memory. Of course, most agent shells will compress context in some custom way. But doesn’t the bitter lesson tell us that the model itself should learn this compression, directly and at scale?

Yu Sun shared an example to illustrate this debate: mathematics. Take a look at Fermat's Last Theorem. For over 350 years, no mathematician could prove it, not because they lacked the right literature but because the solution was highly novel. The conceptual distance between existing mathematical knowledge and the final answer was too vast. When Andrew Wiles finally cracked it in the 1990s, he spent seven years almost isolated, having to invent entirely new techniques to reach the answer. His proof relied on successfully bridging two different branches of mathematics: elliptic curves and modular forms. While Ken Ribet had previously proven that if this connection could be established, Fermat's Last Theorem would automatically be resolved, before Wiles, no one possessed the theoretical tools to actually construct this bridge. Grigori Perelman’s proof of the Poincaré conjecture can be similarly argued.

The critical question is: Do these examples prove that LLMs lack something—a capacity for updated priors and genuine creative thinking? Or does this narrative demonstrate the opposite conclusion—that all human knowledge is merely data available for training and reorganization, and Wiles and Perelman showcased what LLMs could achieve on a larger scale?

This question is empirical, and the answer remains uncertain. However, we do know there are many categories of problems where in-context learning today will fail, while parameter-level learning may succeed. For example:

Figure Caption: Categories of problems where in-context learning fails and parameter learning may prevail

Moreover, in-context learning can only deal with things expressible in language, while weights can encode concepts that prompts cannot convey. Some patterns are too high-dimensional, too implicit, or too deeply structured to fit within context. For instance, distinguishing between benign artifacts and tumors' visual textures in medical scans, or defining the audio micro-vibrations that characterize a speaker's unique rhythm, are patterns difficult to break down into precise vocabulary. Language can approximate them. No matter how long the prompts, they cannot convey these things; such knowledge can only survive within the weights. They exist in the latent space of learning representations, not in words. No matter how large the context window grows, some knowledge is always beyond textual description, only understandable through parameters.

This might explain why explicit "robot remember you" functions (like ChatGPT's memory) often leave users feeling discomfort rather than surprise. What users truly want is not "recollection," but "capability." A model that has internalized your behavioral patterns can generalize to new scenarios; a model that merely recalls your historical records cannot. The gap between "This is what you wrote the last time you replied to this email" (verbatim repetition) and "I understand your thought process well enough to anticipate what you need" is the gap between retrieval and learning.

Introduction to Continuous Learning

There are multiple paths to continuous learning. The dividing line is not about "having memory functions" but rather: Where does compression occur? These paths are distributed along a spectrum, ranging from no compression (pure retrieval, frozen weights) to fully internal compression (weight-level learning, where the model becomes smarter), with an important intermediate zone (modules).

Figure Caption: Three paths of continuous learning—context, modules, weights

Context

On the context side, teams build smarter retrieval pipelines, agent shells, and prompt orchestration. This is the most mature category: the infrastructure is validated, and deployment paths are clear. The limitation lies in depth: the length of the context.

A noteworthy new direction is multi-agent architectures as a scaling strategy for context itself. If a single model is limited to a 128K token window, a group of coordinated agents—each holding its own context, focusing on a slice of the problem, and communicating results—can potentially operate as a near-infinite working memory. Each agent performs in-context learning within its own window; the system performs aggregation. Karpathy’s recent autoresearch project and Cursor's web browsing example are early cases. This is purely a non-parametric method (not changing weights), but it significantly raises the ceiling of what context systems can achieve.

Modules

In the modular space, teams build pluggable knowledge modules (compressed KV caches, adapter layers, external memory stores) to allow general models to specialize without re-training. An 8B model, coupled with suitable modules, can match the performance of a 109B model on target tasks, with memory usage being just a fraction. The appeal lies in its compatibility with existing Transformer infrastructure.

Weights

On the weight update side, researchers pursue true parameter-level learning: sparsely updating relevant parameters in memory layers, optimizing models through reinforcement learning cycles based on feedback, and compressing context into weights during test-time training. These are the deepest methods and the most challenging to deploy, yet they truly allow the model to fully internalize new information or skills.

There are various specific mechanisms for parameter updates. Here are a few research directions:

Figure Caption: Overview of research directions in weight-level learning

Weight-level research encompasses multiple parallel routes. Regularization and weight-space methods have the longest history: EWC (Kirkpatrick et al., 2017) penalizes parameter changes based on their importance to previous tasks; weight interpolation (Kozal et al., 2024) mixes new and old weight configurations in parameter space, but both can be relatively fragile at scale. Test-time training pioneered by Sun et al. (2020) later evolved into architectural primitives (TTT layers, TTT-E2E, TTT-Discover), with a fundamentally different concept: performing gradient descent on test data, compressing new information into parameters at the necessary moment. Meta-learning asks whether we can train models to understand "how to learn." From MAML’s few-shot-friendly parameter initialization (Finn et al., 2017) to Behrouz et al.'s nested learning (Nested Learning, 2025), the latter structures models as a hierarchical optimization problem running rapid adaptation and slow updates at different time scales, inspired by biological memory consolidation.

Distillation retains knowledge from previous tasks by having student models match frozen teacher checkpoints. LoRD (Liu et al., 2025) makes distillation efficient enough to operate continuously by simultaneously pruning the model and replaying buffers. Self-distillation (SDFT, Shenfeld et al., 2026) has flipped the sourcing to use the model's own outputs under expert conditions as a training signal, bypassing the catastrophic forgetting of sequential fine-tuning. Recursive self-improvement operates on a similar principle: STaR (Zelikman et al., 2022) guides reasoning capabilities from self-generated reasoning chains; AlphaEvolve (DeepMind, 2025) uncovers optimization algorithms that have gone unrefined for decades; and Silver and Sutton's "experience era" (2025) defines agent learning as a continuous flow of ongoing experience.

These research directions are converging. TTT-Discover has now fused test-time training with RL-driven exploration. HOPE nests fast and slow learning cycles within a single architecture. SDFT has turned distillation into a foundational operation for self-improvement. The boundaries between categories are becoming blurred. The next generation of continuous learning systems will likely combine various strategies: regularization for stability, meta-learning for acceleration, and self-improvement for compounding benefits. An increasing number of startups are betting on different layers of this tech stack.

The Landscape of Continuous Learning Startups

The non-parametric end of the spectrum is the most well-known. Shell companies (Letta, mem0, Subconscious) build orchestration layers and scaffolding to manage the content fed into the context window. External storage and RAG infrastructures (such as Pinecone, xmemory) provide retrieval backbones. The data exists, the challenge is placing the right slice in front of the model at the right time. As context windows expand, the design space for these companies also grows, especially on the shell end, where a new wave of startups is emerging to manage increasingly complex context strategies.

The parametric end is more nascent and diverse. Companies here are trying some version of "post-deployment compression," enabling models to internalize new information within weights. The paths can be roughly classified into different bets regarding how models should "learn" after deployment.

Partial Compression: Learning without retraining. Some teams are building pluggable knowledge modules (compressed KV caches, adapter layers, external memory storage) that allow general models to specialize without altering core weights. The common argument is that meaningful compression (not just retrieval) can be achieved while keeping the stability-flexibility trade-off manageable because learning is isolated, not dispersed throughout the parameter space. An 8B model paired with the right modules can match the performance of much larger models on target tasks. The advantage lies in composability: modules can be drop-in replacements for existing Transformer architectures and can be exchanged or updated independently, with experimentation costs significantly lower than retraining.

RL and Feedback Loops: Learning from signals. Other teams bet that the richest signals for post-deployment learning already exist within the deployment cycle itself—user corrections, task success and failure, rewards from real-world outcomes. The core idea is that models should treat every interaction as a potential training signal, not just inference requests. This closely mirrors how humans progress at work: doing tasks, receiving feedback, internalizing which methods work. The engineering challenge is converting sparse, noisy, and sometimes adversarial feedback into stable weight updates without catastrophic forgetting. But a model that can truly learn from deployment will generate compound value in ways that context systems cannot.

Data-Centric: Learning from the right signals. A related but distinct bet is that the bottleneck isn't in learning algorithms, but in training data and surrounding systems. These teams focus on filtering, generating, or synthesizing the right data to drive continuous updates: presupposing a model with high-quality, well-structured learning signals that can be meaningfully improved with far fewer gradient steps. This naturally connects with feedback loop companies, but emphasizes upstream issues: whether a model can learn is one matter; what it should learn from and to what extent is another.

New Architectures: Learning capabilities from the ground up. The most radical bet posits that the Transformer architecture itself is the bottleneck, and continuous learning necessitates fundamentally different computational primitives: architectures endowed with continuous time dynamics and built-in memory mechanisms. The argument here is structural: if you want a continuous learning system, you should embed learning mechanisms into the underlying infrastructure.

Figure Caption: Landscape of continuous learning startups

All major labs are actively laying out strategies within these categories. Some are exploring better context management and reasoning chain inference; others are experimenting with external memory modules or sleep-time computational pipelines; still, others have stealth companies pursuing new architectures. This field is early enough that no single method has emerged victorious, and given the broadness of use cases, there should not be just one winner.

Why Naïve Weight Updates Fail

Updating model parameters in a production environment triggers a series of failure modes that remain unresolved at scale.

Figure Caption: Failure modes of naïve weight updates

The engineering problems are well-documented. Catastrophic forgetting means that a model sensitive enough to learn from new data will destroy existing representations—this is the stability-plasticity dilemma. Temporal decoupling refers to invariant rules and variable states being compressed into the same set of weights; updating one damages the other. The failure of logical integration occurs because fact updates do not propagate to their inferences: changes are limited at the token sequence level, not the semantic concept level. Unlearning remains impossible: there is no differentiable subtraction operation; thus, false or toxic knowledge has no precise surgical removal protocol.

There is also a second class of problems that has received less attention. The current separation of training and deployment is not just an engineering convenience; it is the boundary of safety, auditability, and governance. Opening this boundary can lead to multiple issues simultaneously. Safety alignment may unpredictably degrade: even narrow-range fine-tuning on benign data could yield widespread misalignment behavior. Continuous updates create a data poisoning attack surface—a slow, persistent version of prompt injection, but it lives within the weights. Auditability collapses because a continuously updated model is a moving target, making version control, regression testing, or one-time certification impossible. When user interactions are compressed into parameters, privacy risks increase, as sensitive information is baked into representations, making it more challenging to filter than retrieving information from context.

These are open questions, not fundamental impossibilities. Solving them is as much a part of the continuous learning research agenda as tackling core architectural challenges.

From "Memento" to True Memory

Leonard’s tragedy in "Memento" is not that he cannot function—in any scene, he is clever, even outstanding. His tragedy lies in his inability to compound. Each experience remains external—a Polaroid, a tattoo, a note written by someone else. He can retrieve but cannot compress new knowledge.

As Leonard navigates this self-constructed maze, the line between reality and belief begins to blur. His condition not only deprives him of memory; it forces him to constantly reconstruct meaning, making him both a detective and an unreliable narrator within his story.

Today’s AI operates under the same constraints. We have built incredibly powerful retrieval systems: longer context windows, smarter shells, coordinated multi-agent groups, and they work. But retrieval is not equivalent to learning. A system capable of retrieving any fact has not been compelled to seek structure. It has not been forced to generalize. The mechanism that renders training so powerful—transforming raw data into transferable representations—is exactly what we turn off at the moment of deployment.

The path forward will likely not be a single breakthrough but rather a layered system. In-context learning will remain the first line of adaptive defense: it is native, validated, and constantly improving. Modular mechanisms can handle the intermediate ground of personalization and domain specialization. However, for those truly challenging problems—discovery, adversarial adaptation, and implicit knowledge that cannot be expressed in words—we may need to let models continue compressing experiences into parameters after training. This implies advancements in sparse architectures, meta-learning objectives, and self-improvement loops. It may also require us to redefine what a "model" means: not a set of fixed weights, but an evolving system that encompasses its memory, its update algorithms, and its ability to abstract from its experiences.

The filing cabinet keeps getting bigger. But no matter how large the filing cabinet is, it remains a filing cabinet. The breakthrough lies in enabling the model to do the powerful thing of training after deployment: compressing, abstracting, and learning. We stand at the turning point from amnesiac models to those with a glimmer of experience. Otherwise, we will be trapped in our own "Memento."

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。