So a16z dropped a piece about continual learning last week, and I’ve been thinking about it obsessively since, in the way I used to think about the Transformer paper back in 2017 when I first read it three times in a row on a flight and arrived at my destination genuinely confused about what year it was.
The piece is good. Genuinely good. The Memento framing is a little overdone, but I’ll forgive it because the underlying point is correct: we have built extraordinarily capable retrieval systems and dressed them up as something that learns. And I think most people in the industry know this and just don’t say it out loud.
I want to dig into it — what they got right, what I think they’re glossing over, and some things happening in the research literature right now that I haven’t seen talked about enough.
“Retrieval is not learning. A system that can look up any fact has not been forced to find structure.”
That line from the piece is the one I keep returning to. It’s the crux of everything. Let me explain why I both love it and find it incomplete.
The compression argument is correct.
The core claim in the a16z article is that training is powerful because it’s lossy compression. When a model learns, it doesn’t memorize — it abstracts. It finds structure. The mess of the internet gets squeezed through a bottleneck, and what comes out the other side is a set of weights that generalize. And then the moment we deploy the model, we stop the compression and bolt on a filing cabinet instead.
This is, I think, deeply correct. And it’s not just a philosophical point — it has empirical consequences. Think about the difference between a model that has “seen” your email style during fine-tuning versus a model that has your last 500 emails in its context window. The fine-tuned model has internalized something. The context-window model is doing a very fast nearest-neighbor lookup. They produce different outputs, and anyone who’s played with both knows which one feels more like talking to something that actually understands you.
Research note
Sparse memory finetuning (arxiv, 2025) is one of the more interesting recent results here. When training on a stream of TriviaQA facts with full finetuning, performance on NaturalQuestions degraded by 89%. With sparse finetuning of dedicated memory layers, that number dropped to 11%. Sparsity — only updating the relevant fraction of parameters — looks like a meaningful ingredient. The filing cabinet and the weights don’t have to be total opposites.
The filing cabinet critique has limits, though.
Here’s where I’ll push back a little. The piece sets up a dichotomy between “retrieval (bad, shallow)” and “weight updates (good, real learning).” But that’s too clean. The human brain doesn’t work like that either.
Memory consolidation in humans involves something remarkably like retrieval-augmented generation. You don’t compress every experience directly into long-term memory during the experience itself. You replay it — sometimes during sleep, sometimes consciously — and that replay is what drives consolidation. The hippocampus is, in a very rough sense, doing something like RAG over episodic memory and writing summaries back to the cortex. So the dichotomy between “retrieval” and “parametric learning” isn’t as clean as the framing suggests.
The a16z piece does mention biological memory consolidation briefly in the context of HOPE and nested learning, but I’d argue this point deserves more weight. Some of the most promising continual learning architectures are explicitly designed around this two-speed memory system — fast-adapting modules for recent experience, slow-updating parameters for consolidated knowledge. That’s not “filing cabinet vs. weights.” It’s a hybrid that more honestly mirrors what we know about intelligence.
· · ·
The spectrum of approaches, and what’s actually hard.
The piece does a nice job laying out the spectrum from pure context to weight updates. But let me make it concrete, because I think the engineering difficulty of each tier is undersold:
continual learning spectrum — approaches and trade-offs
Context / RAG
Mature, deployable now. Ceiling is context length and retrieval quality. Doesn’t generalize across sessions.
Modules / Adapters
LoRA, KV-cache modules, adapter layers. Compositional and cheaper than full retraining. Still can’t subtract false beliefs.
Weight Updates
TTT, EWC, sparse finetuning, RL loops. Most powerful. Also, most brittle — catastrophic forgetting, poisoning risk, auditability breaks.
The piece is honest that naive weight updates fail, but I think the list of failure modes is actually longer than they describe. The four problems they name — catastrophic forgetting, temporal disentanglement, logical integration failure, and unlearning impossibility — are real. But they don’t mention the evaluation problem, which I think about a lot: how do you know when a continuously updating model has gotten worse at something?
With a static model, you can run evals before and after any intervention. With a continuously updating model, your baseline keeps moving. You’d need something like continuous automated regression testing across capability dimensions, running in parallel with deployment, all the time. That’s not a solved problem, and it’s not cheap.
Test-time training is the most interesting thing happening right now.
The a16z piece mentions test-time training (TTT) in passing. I want to spend more time on it because I think it’s the approach closest to being practically useful, and it’s moving fast.
The basic idea: instead of treating inference as a read-only operation (input goes in, tokens come out, weights unchanged), you run a few gradient steps on the input itself before generating. You’re compressing the current context into the weights — but only temporarily, only locally. TTT-E2E (out of Astera Institute, Nvidia, Stanford, Berkeley, and UC San Diego, late 2025) pushed this further by making inference-time weight updates constant latency regardless of context length. That’s a big deal.
One of the core complaints about long-context models is that inference gets slower as context grows — TTT-E2E sidesteps that by compressing context into weights instead of attending over it.
Hot off the press
ByteDance Seed dropped “In-Place TTT” in April 2026 — it treats the final projection matrices of MLP blocks as fast weights, updating them in-place during inference without architectural changes. The claim is that this is “drop-in” — you can apply it to existing transformers without retraining from scratch. I haven’t reproduced the results, but the idea is elegant: no new architecture, just a different update rule on weights that already exist.
What I find intellectually exciting about TTT is that it dissolves the training/inference boundary without fully committing to continuous learning. It’s a middle ground: the model does compress new information into parameters, but it does so ephemerally, at inference time, without the catastrophic forgetting problem that comes from persistent weight updates.
Whether that’s “real learning” in the sense the a16z piece cares about is debatable. My intuition is that it’s real enough to matter for a large class of practical problems.
The safety argument deserves more weight.
Here’s my biggest critique of the a16z framing. The piece mentions the safety problems with continuous weight updates — alignment degradation, data poisoning surface, auditability — almost as an afterthought, tucked at the end of the technical section. “These are open problems, not fundamental impossibilities.” True! But let me be more concrete about why this keeps me up at night.
The current training/deployment separation is not just an engineering artifact. It’s the primary mechanism by which we have any meaningful accountability over model behavior. When a model is trained, its behavior can be characterized. Tested. Red-teamed. Compared against a known baseline.
When you make the weights updatable after deployment, you have a model whose behavior is a function of every interaction it’s had since release — and you have to audit and safety-test that continuously, in production, under adversarial conditions. That’s a qualitatively different security problem than anything we currently know how to solve.
The case for parametric learning
- Genuine compression, not just retrieval
- Can encode knowledge too tacit for text
- Compounds over time — gets better at your domain
- Doesn’t degrade with context length
- Closer to how biological intelligence works
The case for caution
- Catastrophic forgetting is unsolved at the LLM scale
- Data poisoning surface is persistent and stealthy
- Auditability breaks when weights keep moving
- Safety alignment can degrade from benign updates
- Privacy: user data baked into parameters is hard to remove
The Wiles example cuts both ways.
The Fermat’s Last Theorem example in the piece is genuinely interesting. The argument is: Andrew Wiles had to invent new mathematics — bridge elliptic curves and modular forms — and no amount of context retrieval would have gotten him there. Therefore, LLMs, which can only recombine what exists in their training data, are fundamentally limited for novel discovery.
But I’m not sure this argument holds even on its own terms. Wiles didn’t invent from nothing. He stood on decades of prior work, including Ken Ribet’s result that showed proving the Taniyama–Shimura conjecture would imply Fermat’s Last Theorem. Every piece of mathematical scaffolding Wiles used existed in the literature. What he did was form an unprecedented connection between them, and do it with obsessive focus over seven years.
The empirical question is: is that kind of connection-forming something that scales with model capability, or does it require something architecturally different? I genuinely don’t know. Nobody does. The a16z piece correctly identifies this as an open question. But I’d resist the strong version of the claim that novel discovery is impossible for systems without continuous weight updates. AlphaEvolve’s rediscovery of algorithms that humans had missed for decades is a data point in the other direction.
What I actually think is going to happen
The a16z piece ends with a nice layered model: in-context learning as the first line, modules for personalization, weight updates for the hard stuff. I agree with that architecture broadly. But I think the timeline is longer than the framing implies, and the module layer is going to be more important than either the context layer or the weight-update layer for most practical applications in the near term.
The reason is deployment reality. Persistent weight updates require solving the safety, auditability, and catastrophic forgetting problems simultaneously. That’s a multi-year research agenda. Context management is already good and getting better fast — the TTT-E2E results suggest that we might actually solve the long-context coherence problem in 2026 without touching persistent weights at all.
In the meantime, the most productive thing — and I say this as someone who thinks about this from both a research and an application standpoint — is probably investing deeply in the module layer. LoRA adapters you can swap, compressed KV caches for domain knowledge, feedback loops that generate training data but batch the actual weight updates with human review. It’s less elegant than the continuous-learning dream. It’s also actually deployable.
“The filing cabinet keeps getting bigger. But a bigger filing cabinet is still a filing cabinet.” — the a16z piece gets this right. It just underestimates how hard it is to build something that’s actually a library instead.
The Memento analogy is good. But Leonard Shelby’s tragedy isn’t just that he can’t form new memories. It’s that when he tries to compensate with external systems — the Polaroids, the tattoos, the notes — those systems become attack surfaces. People can manipulate what he writes on himself. The compensation mechanism is also a vulnerability. Which is, I think, a more complete metaphor for what we’re actually dealing with.
Anyway. This is where my head is at. I’m curious what the people actually doing this research think — if you’re working on TTT, sparse finetuning, or any of the architectural approaches, I’d love to hear how the experimental reality compares to the framing in pieces like this one. Hit me in the comments.
References: a16z “From Memento to Memory” (2026) · Sparse Memory Finetuning, arxiv 2510.15103 · TTT-E2E (Tandon et al., 2025) · In-Place TTT, ByteDance Seed (2026) · Van de Ven et al., “Continual Learning and Catastrophic Forgetting” (2024) · Lai et al., “Pareto Continual Learning” (2025) · McCloskey & Cohen (1989), because apparently we’ve been at this for 35 years.