Karpathy in 2026 — The Capability Gap, Cognitive Cores, and the Harness Mindset

Type
Article
Published
2026-05-02
Aliases
Karpathy 2026, cognitive core, harness mindset, capability gap
William Mumler's 'spirit photograph' of Mary Todd Lincoln, c.1872 — a seated woman in dark mourning dress with the pale, ghostly figure of a man (purportedly Abraham Lincoln) standing behind her, hands resting on her shoulders.
William H. Mumler, \"Mrs Lincoln with the spirit of her husband,\" c.1869–72. Karpathy describes LLMs as ghosts — useful when properly summoned, deceptive when read at face value. Mumler's photographs are the canonical case of literal-ghost-imagery; the secondary reading (a hoax that some saw as real) maps neatly onto the capability-gap argument. Source: Wikimedia Commons · public domain.
Summary

Across April 2026, Andrej Karpathy made three closely-related arguments that together describe how to think about LLMs this year: the discourse is split between people on free or year-old models and people running frontier agents in technical domains; current models are oversized because their training data is mostly noise; and the part you should be optimising is the harness around the model, not the model itself.

Overview

Andrej Karpathy, one of the original architects of modern deep learning and the term “vibe coding,” spent much of April 2026 explaining a pattern he saw in public conversation about AI capability — and proposing what to do about it. His argument shows up in three forms across that month: a Twitter thread about why people are talking past each other, a Sequoia masterclass on agentic engineering, and a model-design thesis about cognitive cores. The three threads share a single underlying claim: the noisy debate about whether AI is “real” obscures the fact that the meaningful design decisions have shifted out of the model and into the surrounding system.

For practitioners, the consequence is concrete. The capability frontier of 2026 is being set by people who pay for state-of-the-art agentic models and use them in technical domains — and who have learned to invest in the layer around the model. Anyone studying how LLMs are actually deployed today benefits from understanding why Karpathy frames it this way.

Key Concepts

  • Agent Harness — the deterministic code wrapping the model
  • Context Window — the finite buffer the harness shapes
  • Sub-agents — the parallelisation pattern that survives Karpathy’s “what’s dead” list
  • LLM Knowledge Bases — Karpathy’s earlier compilation pattern, on which the cognitive-core argument builds

The capability gap

Karpathy’s April 9 thread opens with an observation about his timeline: people are reaching radically different conclusions about AI capability and largely talking past each other. He attributes the split to two confounders. The first is recency and tier of use: many of the loudest sceptical voices last meaningfully tested AI on a free or deprecated model — typically ChatGPT’s free tier sometime in the previous year — and let that experience inform their view. They are not wrong about what they saw; they are extrapolating from the wrong data.

The second confounder is that progress is uneven. Even paying users on state-of-the-art models will see modest gains on “typical” queries — search, advice, writing — because those domains are hard to train against with reinforcement learning. They lack what Karpathy calls verifiable rewards: there’s no equivalent of “the unit test passed” for a piece of advice or a persuasive essay. Frontier-lab effort flows toward the domains where verifiable rewards exist and where business value concentrates: programming, math, research. So the strongest improvements of the year — which Karpathy describes as “nothing short of staggering” — show up specifically in agentic coding tools like OpenAI Codex and Claude Code, used professionally by people doing technical work.

The result, in Karpathy’s framing, is two populations speaking past each other. One group experiences the median state of free AI products and laughs at hallucinations and fumbled voice queries. The other group hands a terminal to a frontier model and watches it coherently restructure a codebase or surface a security vulnerability over a 60-minute autonomous session. Both observations are accurate. Neither captures the full picture without the other.

Cognitive cores instead of memorisation engines

The capability-gap thread describes the what. A separate Karpathy argument, surfaced by MilkRoadAI, describes the why — and points toward what comes next. Frontier models, Karpathy argues, are not large because intelligence requires that many parameters. They are large because training data is overwhelmingly noise. When researchers sample documents from real pretraining corpora, they find stock-ticker pages, malformed HTML, spam, and gibberish — not Wikipedia and the Wall Street Journal. One estimate puts Llama 3’s information compression at roughly 0.07 bits per token, meaning the model retains only a hazy memory of most of what it has seen.

Most of a trillion parameters, in this view, are doing memory work rather than cognitive work. The model is functioning as a compression engine for the open internet, and the cognitive layer rides on top of that mass.

Karpathy’s prediction is to separate the two. A cognitive core would contain only the reasoning and problem-solving algorithms, paired with external memory that the model queries on demand. He estimates such a core, trained on high-quality data, could reach genuine intelligence at roughly one billion parameters — versus the 200 billion to 1.8 trillion of current flagship models. The trend already supports this direction: GPT-4o, at around 200B parameters, outperforms the original 1.8T GPT-4, and inference cost for GPT-3.5-level capability fell roughly 280× between 2022 and 2024. The bottleneck on AI progress, Karpathy argues, is no longer compute. It is data quality.

This thesis maps directly onto the existing wiki entry for LLM Knowledge Bases — a cognitive-core-with-external-memory design at the personal-knowledge scale.

The harness mindset

The third thread, surfaced by DeRonin_ summarising Karpathy’s recent talks, distils the practical advice into a slogan: harness > model, always. The improvements that compound in 2026 are not in the model layer — they are in the harness around it. Karpathy’s working list of what compounds:

  • context engineering
  • tool design
  • the orchestrator–subagent pattern
  • eval discipline
  • the harness mindset itself
  • MCP as the protocol layer

The same talk is summarised by aerockrose covering Karpathy’s Sequoia masterclass on agentic engineering — described as “the serious layer above vibe coding.” The framing there is complementary: LLMs as “ghosts” that are useful when properly summoned, “outsource thinking, not understanding,” and the discipline of building applications that survive contact with model variance.

What Karpathy says is already dead

The flip side of the harness mindset is a list of things experienced engineers have stopped investing in. DeRonin_’s thread catalogues ten:

  1. AutoGen / AG2 — moved to community maintenance, releases stalled
  2. CrewAI — demos well, breaks in production
  3. Fully autonomous agent pitches — the AutoGPT/BabyAGI wave; the industry settled on supervised, bounded, evaluated agents
  4. Agent app stores and marketplaces — promised since 2023, no enterprise traction
  5. SWE-bench leaderboard chasing — most public benchmarks have been shown gameable without solving the underlying task
  6. Microsoft Semantic Kernel — unless locked into the Microsoft stack
  7. DSPy — philosophical merit, niche audience; not a general agent framework
  8. Horizontal “build any agent” platforms — Google Agentspace, AWS Bedrock Agents, Copilot Studio
  9. Per-seat SaaS pricing for agent products — the market has moved to outcome-based pricing
  10. Whatever framework went viral on Hacker News this week — wait six months; if it still matters, it will be obvious

Christophe Cazes compresses the through-line: 90% of AI advice has a six-month half-life, but the harness primitives — context engineering, evaluation discipline, the orchestrator-subagent pattern — have proven durable across cycles.

Practical applications

For a researcher or practitioner choosing where to invest learning effort in 2026, Karpathy’s framing yields a short list of bets that look stable:

  • Context engineering as a craft — designing what gets included in each model call, when memory is read versus written, how sub-agents partition work
  • Tool design — the API surface a model sees, written for a model rather than a human
  • Eval discipline — building task-specific verifiers rather than chasing public leaderboards
  • External memory architectures — including personal compilation patterns like the LLM Knowledge Base approach

For institutions teaching AI, the implication is that the syllabus should follow the harness layer, not the model layer. Students who understand context windows, sub-agents, evaluation, and the orchestrator pattern will retain that knowledge across model generations; students trained on a specific framework that goes stale in six months will not.

Limitations and open questions

Karpathy’s cognitive-core thesis is a prediction, not a result. No one has yet built a 1B-parameter cognitive core matched with external memory that demonstrably outperforms a frontier dense model on the same tasks. The architecture exists in pieces — retrieval-augmented systems, knowledge bases, sub-agent forking — but the integrated design is hypothetical.

The “harness > model” claim is also not universally true. A weaker model paired with an excellent harness still loses to a stronger model on the same harness, and the harness can only compensate for so much. The honest reading is that for current model variance, the harness is where leverage concentrates — which may shift again with the next capability jump.

Finally, the dead-tools list is a snapshot of consensus among engineers Karpathy and his interlocutors travel with. Edge cases will exist for every entry on it — DSPy users in research settings, SWE-bench targets that genuinely advance the field. The list is best read as “where energy is no longer accumulating,” not “where no work is happening.”

Sources

  • @karpathy — original capability-gap thread, distinguishes free/old-model users from frontier-agentic-model users
  • @MilkRoadAI — cognitive core vs. memorisation engine; data-quality bottleneck
  • @DeRonin_ — “harness > model” framing, what’s dead, what compounds
  • @aerockrose — Sequoia masterclass on agentic engineering, “LLMs as ghosts”
  • @DataChaz — 90% of AI advice dies in 6 months; the durable primitives