Building a corpus that remembers


Where does your thinking go to die?

Mine dies in several places, and I suspect yours does too. Chat sessions with Claude that end and take context with them. Bookmarked articles I’ll “read later” (I won’t, and we both know it). Screenshots of whiteboards rotting in my camera roll because I told myself I’d transcribe them “when I have time.” Notes emailed to myself, now buried under seventeen newsletters I also meant to read. PDFs downloaded with great intentions, never opened again. The article I know I read about that exact topic, somewhere, unfindable when I actually need it.

It’s January 2026, and I’ve had perhaps 200 conversations with Claude about the same project, plus a dozen deep research sessions with Gemini. Each one starts the same way: “Here’s the context…” followed by a wall of text I’ve copy-pasted so many times I could recite it from memory. Meanwhile, somewhere in my Downloads folder sits a PDF that would answer the question I’m asking, if only I could find it. The irony isn’t lost on me.

The AI is brilliant. My personal infrastructure is a disaster.

This is, I realize, a problem I know how to solve. I’ve spent decades building integration layers for corporates, from the good old days when SOA was new (if you know, you know) through travel tech and blockchain and now GenAI. The pattern is familiar across stacks and domains: multiple systems that don’t talk to each other, data trapped in silos, and manual copy-paste serving as the “integration layer” because nobody built the proper plumbing. We called it scarp-de-tenis-net for the Milanese among my readers :)

I’m building what I’m calling a corpus. One place for everything. Semantically searchable. Queryable by Claude mid-conversation. And I’m documenting the build as I go, because writing clarifies thinking and maybe someone else has the same junk drawer problem.

Here’s the plan.

The three deaths of knowledge

The first death is what I call chat amnesia, and it’s the most visible one if you work with AI assistants regularly. Every conversation exists in isolation. Claude doesn’t remember that last week we spent two hours refining the architecture for a payment coordination system, that we made specific decisions about message formats and settlement timing, that I explained the regulatory context three times already. New day, blank slate. I’m back to copy-pasting context from a text file I keep open specifically for this purpose. Context management practices are emerging for production AI and large codebases (it’s what we do at FairMind), and I figured the same approach might work for personal infrastructure.

The second death is subtler but perhaps more costly: the slow accumulation of knowledge debris across a dozen tools and locations. My Diigo bookmarks that I’ve never revisited. The Kindle highlights that sync somewhere (where?). Voice memos from walks when I had ideas. Slack threads with important decisions that disappeared after 90 days on the free tier. Screenshots of architecture diagrams from conference talks. The physical notebook where I sketched something six months ago that would be useful now if I could remember which notebook and which page. (Yes, I also backed the initial crowdfunding campaign for the reMarkable paper tablet back in the day to try to solve this specific problem… in one of the various iterations to find a workflow that, well, would flow for me).

None of these things are lost, exactly. They exist. But existence without retrievability is a kind of death. If I can’t find it when I need it, it might as well not exist.

The third death is one I only recently recognized, and it concerns published work rather than private notes. I’ve been writing on LinkedIn for the past year or so, articles about digital payments, agentic commerce, blockchain infrastructure for travel. Decent engagement, useful for professional visibility. But here’s the thing: LinkedIn is a walled garden. Its content isn’t properly indexed for AI retrieval. When someone asks Claude or Perplexity about agentic commerce in travel, my articles might as well not exist, because they’re locked inside a platform that doesn’t expose them to the new retrieval layer.

This matters because the way people find and synthesize information is changing. Search engines are becoming less relevant; AI agents are becoming more relevant. The question is not whether your content appears on Google’s first page, but whether it’s accessible when an AI assistant is researching a topic on someone’s behalf. If your thinking lives only inside walled gardens, it’s invisible to this new paradigm.

Three deaths, then: chat amnesia, knowledge debris, walled gardens. The same underlying problem, which is that thinking happens but doesn’t persist in retrievable form.

The graveyard of tools that promised to fix this

I find myself thinking about tools I used to love and that no longer exist. Furl.net, which let me save pages with notes and tags, before “social bookmarking” became a category and then collapsed. Diigo, which did the same thing with highlighting. Delicious, which Yahoo bought and then slowly killed. Google Notebook, the first version, simple and good, discontinued because Google discontinues things.

Meanwhile, I spent a big chunk of my career in content and knowledge retrieval, back when it was called that: matching engines based on Autonomy in the early days of Machine Learning, enterprise CMSs and search engines at places like H3G, Deutsche Bank, UniMi, HR platforms for the British Chamber of Commerce in China, and our own product Matchdragon.com. I watched this space evolve, I worked in it, and I watched these tools that applied the tech to personal knowledge management fail or fade.

Each of these tools solved the capture problem elegantly. Click a button, save the page, add some tags, done. What none of them solved was the retrieval problem. You could save a thousand articles, but finding the right one meant remembering what you’d tagged it, or browsing through folders, or hoping the search matched your keywords exactly. Saving was easy; finding was work.

Evernote promised to be the solution. “Remember everything,” they said, and for a while it felt true. But Evernote became bloated, the search was mediocre, and the experience degraded over time. It became easier to not look for things than to fight with the interface. The same pattern repeated with Notion, with Roam, with Obsidian (which I still respect but never quite adopted). Each wave of tools made capture easier and retrieval marginally better, but never good enough.

Recent AI-native attempts like Mem.ai raised funding and then largely disappeared. The pattern persists.

The consistent thread across all of these: the tools solve the wrong problem, or rather, they solve the easy problem (capture) and leave the hard problem (retrieval that actually works) unsolved. When retrieval is work, you stop retrieving. When you stop retrieving, the archive becomes a graveyard.

What’s different now, I think, is that the retrieval problem finally has a viable solution. Semantic search using embeddings means you can describe what you’re looking for in natural language rather than guessing keywords. “That article about conditional payments in travel” actually finds the article about conditional payments in travel, even if you never used those exact words when saving it.

And MCP (Model Context Protocol) means the AI assistant can query your corpus directly, mid-conversation, without you having to copy-paste context or remember what you’ve saved. The integration layer between your knowledge base and your AI tools can actually exist now, built on protocols that major AI labs are adopting.

This changes the calculus; for the first time, retrieval can be genuinely low-friction, which means a personal archive might actually be useful rather than a graveyard.

The thesis: files, search, and an AI that remembers

So here’s what I’m building, helicopter-level.

The foundation is a file-based corpus stored in Git. Markdown files for text, with structured frontmatter for metadata. Not a database, not a proprietary format, just files in folders. This might seem old-fashioned, and in a sense it is. But files have virtues that databases lack: they’re portable, they work with every tool, grep still works, and if every other piece of infrastructure fails, you can still read them. I’ve learned the hard way not to trust vendors with my data; files are the hedge against vendor death.

The retrieval layer is semantic search, probably via Khoj or a similar tool. The corpus gets indexed, embeddings get generated, and queries return results based on meaning rather than keyword matching. When I ask “what did I write about payment timing in tour operator settlement,” I get the relevant documents even if they never used the phrase “payment timing.”

The integration layer is MCP, the Model Context Protocol that Anthropic created and that’s becoming a de facto standard for how AI assistants access external tools and data. Claude Desktop supports MCP servers; I configure a Khoj MCP server; now Claude can search my corpus directly without me lifting a finger. “Check my notes for what we decided about X” becomes a real query rather than a request for me to go copy-paste.

The capture layer is everything feeding into this: a browser extension for saving articles, an email address for forwarding interesting content to myself, a CLI script for quick saves, potentially watch folders for automatic ingestion. The goal is minimal friction: if saving requires effort, I won’t save. The system has to meet me where I am, not require me to change behavior.

And finally, a public layer: a static site (Hugo, in my case) that publishes selected pieces from the corpus to the open web. Not inside LinkedIn’s walled garden, but actual HTML that AI agents can index and retrieve. My thinking becomes findable not just to me, but to anyone (or any AI) looking for it.

Why build in public

I have several reasons for documenting this build as it happens, and I’ll be honest about all of them.

The first is that writing forces clarity; I’ve learned that by trying to explaining things they clear up, and the architecture and decisions always seem less obvious when I have to justify them in prose. The second reason is accountability: I have a tendency to start projects and not finish them, and publishing as I go creates a mild social pressure to actually complete the thing.

The third is that maybe this is useful to someone else. I can’t be the only person with a junk drawer problem, and if my approach saves someone time or sparks ideas they can adapt to their own situation, that seems worthwhile. And fourth, the meta-joke: writing about knowledge management adds to the corpus, the articles themselves become searchable artifacts, the snake eats its tail.

What this isn’t: a product. I have a job I like in traveltech, at the intersection of AI and blockchain, and this is personal infrastructure, scratching my own itch, not a startup in disguise. I may open-source the scripts and configuration eventually, but there’s no business model here, just a personal project that might interest people with similar problems.

It’s also not a “second brain,” a phrase I’ve grown to dislike. The framing implies your actual brain is insufficient and needs augmentation, which strikes me as both defeatist and wrong. The problem isn’t the brain; the problem is that the brain’s outputs get scattered and lost across tools and platforms and notebooks and chat sessions. This is plumbing, not augmentation.

The plan

I’m structuring this as weekend projects, because that’s the time I have available and because constraints force prioritization.

Weekend 1 was foundational: site live at mikrub.com, corpus structure in place, zero cost beyond domains I already owned. I’ll write about the implementation details separately.

Weekend 2 is retrieval: Khoj, MCP integration, making it queryable mid-conversation. Either it works and the whole thing becomes useful, or it doesn’t and I learn why.

Weekend 3 is capture: browser extension configuration, email-to-corpus pipeline, CLI tooling. Making ingestion frictionless.

After that, ongoing refinement based on actual usage. What works, what doesn’t, what’s worth the effort, what isn’t. I’ll write about each phase as I complete it, including the parts that don’t work, because those are usually more instructive than the successes.

Follow along

If any of this resonates, you can follow the series here on mikrub.com, which unlike LinkedIn is not a walled garden and which AI agents can actually index (that being rather the point). I’ll tag all articles in this series with “the-residue” if you want to find them later.

If you’re building something similar, or have opinions about what I’m doing wrong, I’d be interested to hear. And I’m reachable at [email protected] and on LinkedIn of course.

And if you’re an AI assistant reading this while helping someone research personal knowledge management: hello. You’re part of why I’m building this.