a16z: AI's "amnesia"—can continuous learning cure it?

Original authors: Malika Aubakirova, Matt Bornstein, a16z crypto

Original translation: Deep Tide TechFlow

In Christopher Nolan's *Memento*, the protagonist, Leonard Shelby, lives in a fragmented present. Brain damage has left him with anterograde amnesia, preventing him from forming new memories. Every few minutes, his world resets, trapping him in an eternal "now," unable to remember what just happened or what will happen next. To survive, he gets tattoos and takes Polaroid photos, using these external tools to compensate for the brain's inability to retain memories.

Large language models also live in a similar perpetual state of being. After training, massive amounts of knowledge are frozen in the parameters, preventing the model from forming new memories or updating its parameters based on new experiences. To compensate for this deficiency, we've built a bunch of scaffolding for it: chat history acts as short-term notes, the retrieval system as an external notebook, and system prompts are like tattoos. But the model itself never truly internalizes this new information.

A growing number of researchers believe this is insufficient. Contextual learning (ICL) can solve problems that presuppose that the answer (or fragments of the answer) already exists somewhere in the world. But for problems that require genuine discovery (such as entirely new mathematical proofs), adversarial scenarios (such as security attacks and defenses), or knowledge that is too implicit to be expressed in words, there are strong reasons to believe that the model needs a way to directly incorporate new knowledge and experience into the parameters after deployment.

Contextual learning is temporary. True learning requires compression. Until we allow models to continuously compress, we may be trapped in the eternal present of *Memento*. Conversely, if we can train models to learn their own memory architecture, rather than relying on external, custom tools, it might unlock a whole new dimension of scaling.

This research field is called continuous learning . The concept isn't new (see McCloskey and Cohen's 1989 paper), but we believe it's one of the most important research directions in AI today. The explosive growth in model capabilities over the past two or three years has made the gap between what a model "knows" and what it "can know" increasingly apparent. The purpose of this article is to share what we've learned from top researchers in this field, helping to clarify different paths to continuous learning and promoting this topic within the startup ecosystem.

Note: This article was made possible by in-depth exchanges with a group of outstanding researchers, doctoral students, and entrepreneurs who generously shared their work and insights in the field of continuous learning. From theoretical foundations to the engineering realities of post-deployment learning, their insights make this article far more solid than anything we could have written alone. Thank you for your time and ideas!

Let's discuss the context first.

Before defending parametric learning (i.e., learning to update model weights), it's necessary to acknowledge the fact that contextual learning does indeed work. And there's a strong argument that it will continue to win.

The essence of a Transformer is a conditional token predictor based on a sequence. Give it the right sequence, and you get surprisingly rich behavior without even touching the weights. This is why methods like context management, hint engineering, instruction fine-tuning, and few-shot examples are so powerful. Intelligence is encapsulated in static parameters, while the exhibited capabilities vary dramatically depending on what you feed into the window.

Cursor's recent in-depth article on scaling autonomous programming agents is a good example: the model weights are fixed, and what really makes the system run is the careful arrangement of the context—what to put in, when to do the summary, and how to maintain a coherent state during hours of autonomous operation.

OpenClaw is another good example. Its popularity isn't due to special model access (the underlying model is available to everyone), but rather because it extremely efficiently transforms context and tools into working states: tracking what you're doing, structuring intermediate products, deciding when to re-inject cue words, and maintaining a persistent memory of previous work. OpenClaw elevates the "shell design" of intelligent agents to the level of an independent discipline.

When suggestion engineering first emerged, many researchers were skeptical that "suggestion words alone" could become a legitimate interface. It seemed like a hack. But it's a native product of the Transformer architecture, requiring no retraining and automatically upgrading as the model improves. The stronger the model, the stronger the suggestions. "Simple but native" interfaces often win because they are directly coupled to the underlying system, rather than working against it. This is precisely the trajectory of LLM development to date.

State-space model: a steroid version of context

As the mainstream workflow shifts from raw LLM calls to agent loops, context learning models face increasing pressure. Previously, it was relatively rare for the context window to be completely filled. This typically occurred when the LLM was required to complete a long series of discrete tasks, allowing the application layer to more directly prune and compress the chat history.

However, for an agent, a single task can consume a large portion of the total available context. Each step of the agent's loop depends on the context passed from previous iterations. Moreover, they often fail after 20 to 100 steps because the "thread breaks": the context becomes cluttered, coherence degrades, and convergence fails.

Therefore, major AI labs are now investing heavily (i.e., large-scale training runs) in developing models with extremely long context windows. This is a natural path, as it builds on already effective methods (contextual learning) and aligns with the industry's broader trend towards inference-time computation. The most common architecture involves interspersing fixed-memory layers between regular attention heads, namely state-space models (SSM) and linear attention variants (hereinafter collectively referred to as SSM). SSM offers a fundamentally better scaling curve in long-context scenarios.

Caption: Comparison of scaling between SSM and traditional attention mechanisms

The goal is to help agents increase the number of consecutive steps by several orders of magnitude, from approximately 20 steps to approximately 20,000 steps, without losing the extensive skills and knowledge offered by traditional Transformers. If successful, this would be a major breakthrough for long-running agents.

You can even think of this approach as a form of continuous learning: while the model weights are not updated, an external memory layer is introduced that requires almost no resetting.

Therefore, these non-parametric methods are real and powerful. Any evaluation of continuous learning must begin here. The question isn't whether today's contextual systems are useful—they certainly are. The question is: have we reached our ceiling, and can new methods take us further?

What's missing from the context: the "filing cabinet fallacy"

"What happens with AGI and pre-training is that, in a sense, they overshoot... Humans are not AGI. Yes, humans do have a skill base, but humans lack a vast amount of knowledge. What we rely on is continuous learning."

If I create a super-intelligent 15-year-old, he knows nothing. A good student, eager to learn. You could say, become a programmer, become a doctor. Deployment itself involves a process of learning and trial and error. It's a process, not just throwing a finished product out. —Ilya Sutskever

Imagine a system with unlimited storage space. The world's largest filing cabinet, where every fact is perfectly indexed and instantly searchable. It can find anything. Did it learn anything?

No. It was never forced to compress.

This is the core of our argument, referencing a point previously made by Ilya Sutskever: LLMs are essentially compression algorithms. During training, they compress the internet into parameters. Compression is lossy, and it is precisely this lossiness that makes it powerful. Compression forces the model to find structure, generalize, and build representations that can transfer across contexts. A model that memorizes all training samples is less effective than a model that extracts underlying patterns. Lossy compression itself is a form of learning.

Ironically, the very mechanisms that make LLM so powerful during training (compressing raw data into compact, transferable representations) are precisely what we refuse to let continue doing after deployment. We stopped compression at the moment of release and replaced it with external memory.

Of course, most agent shells compress the context in some custom way. But doesn't the bitter lesson tell us that the model itself should learn this compression, directly and on a large scale?

Yu Sun shared an example to illustrate this debate: mathematics. Consider Fermat's Last Theorem. For over 350 years, no mathematician has been able to prove it, not because they lacked the correct documentation, but because the solution was so novel. The conceptual gap between existing mathematical knowledge and the final answer was too large.

When Andrew Wiles finally cracked it in the 1990s, he spent seven years working in near isolation, having to invent entirely new techniques to reach the answer. His proof relied on successfully bridging two distinct branches of mathematics: elliptic curves and modular forms. While Ken Ribet had previously proven that establishing this connection would automatically solve Fermat's Last Theorem, no one before Wiles possessed the theoretical tools to actually construct this bridge. Grigori Perelman's proof of the Poincaré conjecture can be demonstrated in a similar way.

The core question is: Do these examples prove that LLM lacks something—some ability to update prior knowledge and engage in truly creative thinking? Or does this story actually prove the opposite conclusion—that all human knowledge is merely data that can be trained and reorganized, and that Wiles and Perelman merely demonstrated what LLM can do on a larger scale?

This is an empirical question, and the answer is still uncertain. However, we do know that contextual learning fails for many categories of problems today, while parametric learning might be useful. For example:

Caption: Problem categories where contextual learning fails and parameter learning may succeed.

More importantly, contextual learning can only handle what can be expressed in language, while weights can encode concepts that cue words cannot convey in words. Some patterns are too high-dimensional, too implicit, or too deeply structured to fit into context. For example, the visual textures used to distinguish between benign artifacts and tumors in medical scans, or the subtle audio fluctuations that define a speaker's unique rhythm, are not easily broken down into precise words.

Language can only approximate them. No matter how long the prompts are, they cannot convey these things; this kind of knowledge can only exist in weights. They live in the latent space of learned representations, not in words. No matter how large the context window grows, there will always be some knowledge that cannot be described by text and can only be carried by parameters.

This might explain why explicit "bot remembers you" features (like ChatGPT's memory) often make users uncomfortable rather than delighted. What users really want isn't "recollection," but "capability." A model that has internalized your behavioral patterns can generalize to new scenarios; a model that merely recalls your historical records cannot. The difference between "This is what you wrote when you last replied to this email" (word-for-word retelling) and "I understand your thought process well enough to anticipate your needs" is the difference between retrieval and learning.

Getting Started with Continuous Learning

There are multiple paths to continuous learning. The dividing line is not whether or not there is a memory function, but where the compression occurs. These paths are distributed along a spectrum, from no compression (pure retrieval, weight freezing) to complete internal compression (weight-level learning, the model becomes smarter), and there is an important zone in between (modules).

Caption: Three paths to continuous learning—context, module, and weight.

Context

On the context side, the team is building smarter retrieval pipelines, agent shells, and cue word orchestration. This is the most mature category: the infrastructure is proven, and the deployment path is clear. The limitation lies in depth: the length of the context.

A noteworthy new direction: multi-agent architecture as a scaling strategy for the context itself. If a single model is confined to a 128K token window, a coordinated swarm of agents—each holding its own context, focusing on a slice of the problem, and communicating results with each other—can approach infinite working memory overall. Each agent learns the context within its own window; the system aggregates. Karpathy's recent autoresearch project and Cursor's example of building a web browser are early examples. This is a purely non-parametric approach (without changing the weights), but it significantly raises the upper limit of what a context-based system can achieve.

Module

Within the module space, the team builds pluggable knowledge modules (compressed key-value caches, adapter layers, and external memory storage), enabling general-purpose models to specialize without retraining. An 8B model, with appropriate modules, can match the performance of a 109B model on the target task, while using only a fraction of its memory footprint. Its appeal lies in its compatibility with existing Transformer infrastructure.

Weight

On the weight update side, researchers are pursuing true parameter-level learning: sparse memory layers that update only relevant parameter fragments, reinforcement learning loops that optimize the model from feedback, and test-time training that compresses context into weights during inference. These are the deepest methods and the most difficult to deploy, but they truly allow the model to fully internalize new information or skills.

There are various specific mechanisms for parameter updates. Here are a few research directions:

Caption: Overview of research directions in weighted learning

Research on weighted systems encompasses multiple parallel approaches. Regularization and weight space methods have the longest history: EWC (Kirkpatrick et al., 2017) penalizes parameter variations based on their importance to previous tasks; weighted interpolation (Kozal et al., 2024) mixes new and old weight configurations in the parameter space, but both are relatively fragile at large scales.

Training during testing was pioneered by Sun et al. (2020) and later developed into architectural primitives (TTT layer, TTT-E2E, TTT-Discover), with a completely different approach: performing gradient descent on the test data and compressing new information into the parameters at the moment needed.

Meta-learning asks: Can we train a model that knows "how to learn"? From few-shot friendly parameter initialization in MAML (Finn et al., 2017) to nested learning in Behrouz et al. (Nested Learning, 2025), the latter structures the model as a hierarchical optimization problem, running modules that adapt quickly and update slowly at different time scales, inspired by biological memory consolidation.

Distillation preserves knowledge of previous tasks by having the student model match frozen teacher checkpoints. LoRD (Liu et al., 2025) makes distillation efficient enough to run continuously by simultaneously pruning the model and the replay buffer. Self-distillation (SDFT, Shenfeld et al., 2026) flips the source, using the model's own output under expert conditions as the training signal, bypassing the catastrophic forgetting of sequential fine-tuning.

Recursive self-improvement operates on similar ideas: STaR (Zelikman et al., 2022) guides reasoning ability from self-generated reasoning chains; AlphaEvolve (DeepMind, 2025) discovers algorithmic optimizations that have not been improved for decades; Silver and Sutton’s “Era of Experience” (2025) defines agent learning as a never-ending, continuous stream of experience.

These research directions are converging. TTT-Discover has integrated test-time training and RL-driven exploration. HOPE nests fast and slow learning loops within a single architecture. SDFT turns distillation into a fundamental self-improving operation. The boundaries between columns are blurring. The next generation of continuous learning systems will likely combine multiple strategies: using regularization for stabilization, using meta-learning for acceleration, and using self-improvement for compounding. A growing number of startups are betting on different layers of this technology stack.

Continuous learning of the entrepreneurial landscape

The nonparametric side of the spectrum is best known. Shell companies (Letta, mem0, Subconscious) build orchestration layers and scaffolding to manage the content placed into the context window. External storage and RAG infrastructure (such as Pinecone, xmemory) provide the retrieval backbone. The data exists; the challenge is to put the right slice in front of the model at the right time. As the context window expands, the design space for these companies also grows, especially on the shell side, where a new wave of startups is emerging to manage increasingly complex context strategies.

The parameters are earlier and more diverse. The company here is experimenting with some form of "post-deployment compression," allowing the model to internalize new information within the weights. The paths can be broadly divided into several different bets regarding "how" the model should learn after deployment.

Partial compression: Learning without retraining. Some teams are building pluggable knowledge modules (compressed key-value caches, adapter layers, external memory storage) to allow general-purpose models to specialize without altering core weights. The common argument is that you can achieve meaningful compression (not just retrieval) while keeping the stability-plasticity tradeoff manageable because learning is isolated, not scattered across the parameter space. An 8B model with suitable modules can match the performance of much larger models on the target task. The advantage is composability: modules can be plugged and played with existing Transformer architectures, can be independently exchanged or updated, and the experimental cost is far lower than retraining.

RL and Feedback Loops: Learning from Signals. Some teams are betting that the richest signals for post-deployment learning already exist within the deployment loop itself—user corrections, task successes and failures, and reward signals from real-world outcomes. The core idea is that the model should treat every interaction as a potential training signal, not just an inference request. This is highly similar to how humans improve at work: do the work, get feedback, and internalize which methods work. The engineering challenge lies in transforming sparse, noisy, and sometimes adversarial feedback into stable weight updates without catastrophic forgetting. But a model that truly learns from deployment will generate compounding value in ways that contextual systems cannot.

Data-centric: Learning from the right signals. A related but distinct bet is that the bottleneck isn't the learning algorithm, but the training data and the surrounding system. These teams focus on sifting, generating, or synthesizing the right data to drive continuous updates: assuming a model with high-quality, well-structured learning signals that requires far fewer gradient steps to meaningfully improve. This naturally connects to feedback loop companies, but emphasizes upstream issues: whether a model can learn is one thing, what it should learn from and to what extent is another.

New Architecture: Learning Capabilities Designed from the Ground Up. The most radical bet argues that the Transformer architecture itself is the bottleneck, and continuous learning requires fundamentally different computational primitives: an architecture with continuous-time dynamics and built-in memory mechanisms. The argument here is structural: if you want a continuously learning system, you should embed the learning mechanism into the underlying infrastructure.

Caption: Continuous Learning Startup Landscape

All major labs are actively positioning themselves in these categories. Some are exploring better context management and thought chain reasoning, others are experimenting with external memory modules or sleep-time computation pipelines, and a few stealth companies are pursuing new architectures. This field is early enough that no single approach has already won, and given the breadth of use cases, there shouldn't be only one winner.

Why does the naive weight update fail?

Updating model parameters in a production environment can trigger a series of failure modes that are currently unresolved on a large scale.

Caption: Failure mode of naive weight update

Engineering problems are well documented. Catastrophic forgetting means that models sensitive enough to learn from new data can destroy existing representations—the stability-plasticity dilemma. Temporal decoupling refers to invariant rules and mutable states being compressed into the same set of weights, where updating one corrupts the other. Logical integration fails because fact updates do not propagate to their inferences: changes are confined to the token sequence level, not the semantic concept level. Unlearning remains impossible: there is no differentiable subtraction operation, therefore there is no precise surgical excision scheme for false or toxic knowledge.

There is a second type of problem that receives less attention. The current separation of training and deployment is not merely an engineering convenience; it represents the boundary of security, auditability, and governance. Opening this boundary can cause multiple things to go wrong simultaneously. Security alignment can degrade unpredictably: even narrow-range tweaks on benign data can produce widespread misalignment.

Continuous updates create a data poisoning attack surface—a slow, persistent version of the hint that lives within the weights. Auditability collapses because a continuously updated model is a moving target, making version control, regression testing, or one-time authentication impossible. Privacy risks are exacerbated when user interactions are compressed into parameters, as sensitive information is baked into the representation, making it more difficult to filter than information retrieved from the context.

These are open questions, not fundamentally impossible ones. Solving them, like addressing core architectural challenges, is part of an ongoing learning and research agenda.

From "fragments of memory" to true memory

Leonard's tragedy in *Memento* isn't that he's incapable of functioning—he's resourceful, even brilliant, in every scene. His tragedy lies in his inability to compound his knowledge. Every experience remains external—a Polaroid photo, a tattoo, a note in someone else's handwriting. He can retrieve information, but he can't compress new knowledge.

As Leonard navigates this self-constructed labyrinth, the lines between reality and belief begin to blur. His illness does more than just rob him of his memories; it forces him to constantly reconstruct meaning , making him both the detective and the unreliable narrator of his own story.

Today's AI operates under the same constraints. We've built incredibly powerful retrieval systems: longer context windows, smarter shells, coordinated multi-agent swarms, and they work. But retrieval is not learning. A system that can retrieve any fact isn't forced to search for structure. It isn't forced to generalize. The lossy compression that makes training so powerful—the mechanism that transforms raw data into transferable representations—is precisely what we turn off the moment we deploy it.

The path forward is likely not a single breakthrough, but a layered system. Contextual learning will remain the first line of adaptive defense: it is native, proven, and constantly improving. Modular mechanisms can handle the middle ground between personalization and domain specialization.

But for truly challenging problems—discovery, adversarial adaptation, and tacit knowledge that cannot be expressed in words—we may need to allow models to continue compressing experience into their parameters after training. This implies advancements in sparse architectures, meta-learning objectives, and self-improvement loops. It may also require us to redefine the meaning of "model": not a fixed set of weights, but an evolving system encompassing its memory, its update algorithm, and its ability to abstract from its own experience.

Filing cabinets are getting bigger and bigger. But even the biggest filing cabinet is still just a filing cabinet. The breakthrough lies in what makes the model powerful during training after deployment: compression, abstraction, and learning. We stand at the turning point from a model with amnesia to one with a glimmer of experience. Otherwise, we will be trapped in our own fragmented memories.