What Karpathy's LLM Wiki Doesn't Tell You


This post follows on from Building Laura and Laura in the Wild. If you want to try the approach yourself, see the hands-on tutorial.


The problem with RAG isn’t retrieval - it’s forgetting.

Every time you query a RAG system, it re-derives the answer from scratch. The connections you made yesterday? Gone. The synthesis across three documents? Recomputed, slightly differently, with no memory that you already did this work.

For research workflows, this is maddening. You’re not just looking up facts - you’re building understanding. And understanding compounds. Or it should.

The LLM Wiki pattern

Andrej Karpathy recently published a gist that crystallised something I’d been circling for months. He calls it the “LLM Wiki” - a persistent markdown knowledge base where the LLM does the grunt work of summarising, cross-referencing, and filing, while you curate the sources.

Three layers:

  • Raw sources - immutable documents you feed in
  • Wiki pages - LLM-generated markdown with wikilinks
  • Schema - config defining structure and workflows

The key insight: cross-references are pre-computed. When you query, the synthesis is already there. No re-derivation.

He also describes a lint operation - health checks for contradictions, orphan pages, missing links. The wiki maintains itself.

It’s a clean model. I’d been building something similar with my research assistant Laura, and reading his gist felt like validation. But it also surfaced gaps - things you hit when you actually use this pattern day-to-day.

Problem 1: Trust

Karpathy’s model assumes the LLM output is reliable enough to build on. In practice, this is the hardest part.

When Laura creates a concept note, it starts with confidence: low. This isn’t pessimism - it’s bookkeeping. AI-generated content from a single web search isn’t the same as a claim I’ve verified across multiple sources.

The confidence ladder:

LevelMeaning
lowAI-generated, single source, unverified
mediummultiple sources agree, partially verified
highwell-sourced, cross-referenced, no contradictions
verifiedI’ve confirmed it myself

Notes can climb or fall. Adding a second source that agrees? Bump to medium. Finding a contradiction? Back to low with a #verify tag.

Without this, you end up with a wiki full of plausible-sounding claims you can’t actually rely on. The synthesis compounds, but so do the errors.

Problem 2: Decay

The gist describes a wiki for “stable, curated knowledge.” But knowledge isn’t stable. APIs change. Best practices evolve. That “current” summary of a tool becomes stale the moment a new version ships.

I added review scheduling after a documentation audit at work revealed the obvious: docs without review dates don’t get reviewed. They rot.

Every note can have:

  • review-by - when it’s due for a check
  • review-interval - how often it should recur (30 days, 90 days, or 0 for evergreen)
  • last-reviewed / reviewed-by - accountability

The /lint reviews command surfaces what’s overdue. It’s not glamorous, but it’s the difference between a wiki that stays useful and one that quietly becomes a liability.

Problem 3: Teams

Karpathy explicitly scopes his approach to individual researchers. Fair enough - it’s a personal workflow. But I work with others, and “who owns this note?” became a real question.

Not enterprise-level access controls. Just: if this note is wrong, who should fix it?

I added an owner field and a team-members list in the config. The linter flags notes with missing owners, unknown owners (someone left?), or ownership orphaned when people move on.

It’s lightweight. No permissions, no locks. Just enough structure that a small team can share a vault without the “I thought you were maintaining that” problem.

What I stole

Reading Karpathy’s gist, I realised my linting was too informal. His description of lint as a first-class operation - not just “check for broken links” but a health score, tracked over time - was sharper than what I had.

I formalised it:

  • Machine-readable reports - _meta/lint-report.md with consistent structure
  • History tracking - is the health score trending up or down?
  • Confidence distribution - what percentage is actually verified?
  • Cluster detection - notes linking only to each other, isolated from the wider vault

The lint history is underrated. Seeing “orphaned notes: 12 to 8 to 5” over three sessions is motivating in a way that a one-off report isn’t.

What’s still hard

Contradictions. The linter can surface notes on the same topic for manual review, but it can’t detect semantic contradictions. If two notes disagree, I have to notice.

Scope creep. A vault that tries to cover everything covers nothing well. The discipline of “what is this vault for?” remains human work.

The bootstrap problem. An empty vault is intimidating. The two-phase research approach (light trawl first, go deeper on request) helps, but there’s still a cold-start cost before the compounding kicks in.

The pattern works

Karpathy’s framing is useful: the LLM handles the grunt work, you handle the curation. But the practice reveals details the theory doesn’t cover. Trust needs tracking. Knowledge decays. Teams need ownership.

If you’re building something similar, start with confidence scoring. You’ll thank yourself later.