Weekend 1 of building a personal knowledge corpus: what worked, what didn’t, and what I learned about AI-proofing a website.


In the previous article, I outlined the problem: context trapped in chat sessions, knowledge scattered across tools, published work invisible to AI agents because it lives in walled gardens. The thesis was files plus semantic search plus MCP integration equals thinking that persists. The plan was three weekends of building.

Weekend 1 was about foundations: get a site live, get domains sorted, get email working, create the corpus structure. The unsexy infrastructure that everything else depends on. Here’s what actually happened.

Agent-assisted, human-steered

Before diving into specifics: this entire project is an exercise in agent-assisted development. Claude suggested the stack (Hugo, Cloudflare Pages, PaperMod theme), the domain strategy, the email setup, and planned the architecture. The agent leads on synthesis and implementation; the human stays in control of decisions and steering.

That said: understanding what’s happening underneath is mandatory if you want a setup that’s secure and performant. If you’re old enough to call yourself a “webmaster” (and remember what that meant in the late 90s: HTML, FTP, Perl CGI, Apache configs), that skillset still applies, mutatis mutandis. I’ve updated through generations: dedicated servers, VPS, containers, now CDN-based serverless. But under the hood there’s still an httpd process running on silicon somewhere. (I hear your “ok boomer” from here.)

The value of the webmaster background isn’t that you write the code yourself; it’s that you can evaluate what the agent proposes, smell when something’s wrong, and steer the agent quickly when things don’t work as documented. Which they often don’t. Which saves your junior agent a lot of trial and error loops.

The domain question

I own three domains that could plausibly host this: mikrub.com, mik.co, and ruberl.com. The obvious answer is “pick one,” but the obvious answer misses some nuance.

mikrub.com is what I’m calling a Googlewhack: search for “mikrub” and you get me, nothing else. I own the namespace. The @mikrub handle on most platforms is mine too. It’s not a beautiful domain, but it’s unambiguously mine, and SEO equity compounds over time on a canonical URL you never change.

mik.co is short, memorable, easy to say out loud, perfect for slides and QR codes. It’s also a three-letter .co domain, which means it has resale value. If I build everything on mik.co and later want to sell it, I’ve created a mess.

ruberl.com is my surname, institutional gravitas, good for email. [email protected] reads more professionally than [email protected], and there’s no risk of typo leakage to some other mik.com owner.

So the architecture became: mikrub.com as canonical (all content, all SEO equity), mik.co as redirect and shortlink (sellable asset that I can detach without breaking anything), ruberl.com for email and redirect. Three domains, one strategy, future optionality preserved.

A side note on agentic research: while investigating domain options, I asked Claude to research the “ruberl” namespace and ended up in a genealogical rabbit hole, trying to corroborate stories my grandfather told about family origins. Two hours later I had theories about 19th-century Bohemian migration but no domain decision. Worth flagging: the ease of going deep with AI creates risks. There’s emerging discussion of compulsive GenAI usage patterns, and tokens equal money equal energy. Not every digression needs to happen. I’m digressing about digression now, which proves the point.

Email: the tracking pixel discovery

I needed outbound email from [email protected], which meant SMTP since Cloudflare Email Routing only handles receiving. The obvious choice was Brevo (formerly Sendinblue), which I’d used before.

Then I looked at the settings. Tracking enabled by default: click tracking, open tracking, the whole marketing surveillance apparatus. Brevo is designed for marketing email, not transactional, but defaults matter. You have to actively opt out, and even then I wasn’t confident some “anonymized” pixel wouldn’t sneak through.

I switched to Resend instead. API-first, transactional-focused, no marketing DNA. Settings page: tracking off, TLS opportunistic, done. The SMTP setup with Gmail’s “Send As” feature took ten minutes, and now [email protected] works both directions with clean headers and no surveillance payload.

(Cloudflare has announced they’re working on native SMTP services, according to their blog. I’m on the waiting list. When that ships, I can consolidate further. Until then, Resend does the job.)

Small decision, but symptomatic. The default internet is hostile to privacy, and building personal infrastructure means actively choosing against that default at every layer.

Hugo and Cloudflare: working around outdated agentic knowledge

I’d never used Hugo before this project. Claude’s knowledge of the setup process was mostly accurate, but the Cloudflare UI has changed since its training data was collected.

Workers and Pages have been merged into a single “Workers & Pages” section, and when you click “Create,” you land in a Workers-first flow that asks for a “Deploy command” and mentions Wrangler. This is wrong for Pages. The actual Pages flow is hidden behind a small text link at the bottom: “Looking to deploy Pages? Get started.”

Once you find the right flow, it works. Connect to GitHub, select Hugo from the framework preset, but still add --minify to the build command manually, set output directory to public, add a HUGO_VERSION environment variable, deploy. But the UX archaeology required to get there was frustrating. More gotchas: project names cannot contain dots (so mikrub-com, not mikrub.com, which is just an internal identifier but still confusing). I document all this because I’ll forget otherwise, and maybe it saves someone else the same confusion.

Hugo itself was smoother. The PaperMod theme is clean and fast, configuration is straightforward, and the local server shows changes instantly. One gotcha: Hugo now creates hugo.toml by default instead of the old config.toml, and some older documentation and tutorials still reference the old filename. Another gotcha: PaperMod requires env: production in hugo.toml for some features, which isn’t obvious until things don’t render as expected.

Redirects: the infinite loop lesson

With the site live on mikrub.com, I needed redirects from mik.co and ruberl.com. Cloudflare Redirect Rules seemed straightforward: match hostname, redirect to mikrub.com, preserve path. Simple.

Except I initially set the rule to trigger on “All incoming requests” instead of a specific hostname match. This created an infinite loop: request hits mikrub.com, rule triggers, redirects to mikrub.com, rule triggers again. ERR_TOO_MANY_REDIRECTS. The fix was obvious once diagnosed (match only the specific hostname being redirected), but the browser caches redirects aggressively, so testing required clearing cache between attempts.

The www.mikrub.com redirect had the same potential trap. Solution: explicit hostname match, http.host eq "www.mikrub.com", dynamic redirect to apex domain. Test, clear cache, test again, confirm 301 response.

One more DNS cleanup: migrating domains from my previous registrar (Dynadot) left orphaned NS records and a wildcard A record pointing to old infrastructure. These needed manual deletion in Cloudflare’s DNS settings. Not obvious, easy to miss, causes mysterious behavior if you don’t catch it.

AI visibility: the robots.txt surprise

Here’s something I didn’t expect: Cloudflare, by default, blocks AI crawlers.

Their “Bot Fight Mode” includes a managed robots.txt that disallows ClaudeBot, GPTBot, and other AI agents. For most sites, this might be desirable (reduce crawl load, protect content from training data scrapers). For a site whose explicit purpose is to be discoverable by AI agents, it’s exactly backwards.

The fix required two steps. First, in Security → Bots, disable “Instruct bot traffic with robots.txt” (the managed block). Second, add a custom robots.txt to the static folder:

User-agent: *
Allow: /

Sitemap: https://mikrub.com/sitemap.xml

Verify at /robots.txt that only your content appears, not Cloudflare’s managed rules. Test by asking Claude to fetch an article URL directly; if it works, you’re visible. (Direct fetch worked immediately for me. Web search indexing takes longer, as expected.)

This is worth emphasizing: if you’re publishing specifically to be found by AI agents (and increasingly that’s a meaningful distribution channel), you need to actively configure for it. The default posture of infrastructure is AI-hostile.

Corpus structure: the one bucket philosophy

The architecture is purely file-based at this stage, simple enough that it doesn’t warrant a diagram. That’s a feature, not a limitation.

The repository structure:

mikrub.com/
├── content/
│   ├── posts/              ← Published articles
│   └── about.md            ← About page
├── static/
│   └── robots.txt          ← AI crawler permissions
├── corpus/
│   ├── inbox/              ← HANDOVER.md (current state)
│   ├── sources/            ← Project log, architecture, research
│   ├── skills/             ← Procedural docs for Claude
│   ├── transcripts/        ← Archived chat sessions (Git LFS)
│   └── assets/             ← Binary files (future)
├── themes/PaperMod/        ← Hugo theme (submodule)
├── hugo.toml               ← Site configuration
└── .gitattributes          ← LFS config for transcripts

The corpus itself follows a one-bucket philosophy: no elaborate folder hierarchy, no decision about where something “belongs.” The key insight from previous failed systems: any organization scheme requiring decisions at capture time creates friction, and friction kills capture. Dump everything into sources, use frontmatter to tag what kind of thing it is, let semantic search handle retrieval. The intelligence is in the index, not the folder structure. (Daniel Miessler’s Fabric project takes a similar philosophy.)

Frontmatter follows a consistent schema:

---
title: "Visa agent payment analysis"
source_type: research       # research | draft | log | meta
lifecycle: snapshot         # snapshot (static) | living (updated)
date_created: 2026-01-15
public: false
tags:
  - payments
  - agentic-commerce
---

The goal: enough structure for machines to work with, little enough that humans don’t resist using it.

Session management: the workflow that emerged

Working with Claude across multiple sessions meant context got lost, decisions weren’t recorded, and previous work was hard to reference.

The planning phase itself produces documents: an architecture “charter” describing system design, and a development guide with implementation details. These aren’t static; they update with discoveries. When you hit a gotcha (Cloudflare’s hidden Pages flow, the robots.txt default), document it. When a decision changes architecture (Resend instead of Brevo, Git LFS for transcripts), update the charter. Documents evolve alongside work.

The solution for session continuity is a set of key files and procedural skills:

Key files:

FilePurpose
corpus/inbox/HANDOVER.mdCurrent project state. Start here.
corpus/sources/mikrub-project-log.mdFull decision history (append-only)
corpus/sources/mik-knowledge-system-architecture.mdSystem design charter
corpus/skills/*.mdProcedural skills for Claude

Procedural skills:

SkillWhat it does
update-project-logAppend decisions to permanent record
archive-transcriptSave conversation to corpus/transcripts/
agent-handoverUpdate HANDOVER.md with current state
update-draftIncorporate new research into article draft

The first three run at end of every session. The last avoids dumping research text into chat when refining articles.

The project log is append-only: if you can edit history, the log becomes unreliable. New entry to correct a mistake, never edit old entries.

Session length monitoring uses transcript size as proxy for context window limits (Claude doesn’t expose token count). Above 400KB, start wrapping up. Above 500KB, complete current task and hand over. Transcripts go into Git LFS since they’re large, and they’re searchable once the semantic index runs.

This workflow emerged from friction, not planning. But it’s now part of the infrastructure.

What’s live

At the end of Weekend 1:

mikrub.com is live with four articles migrated from LinkedIn (canonical URLs pointing back to originals), plus about page, RSS feed, and AI-visible robots.txt. Migration was instructive: feed PDF printouts to Claude, get Hugo-ready markdown back. No copy-paste, no manual reformatting.

mik.co redirects cleanly, preserving paths. ruberl.com does the same. Deep links work: mik.co/posts/some-article lands at mikrub.com/posts/some-article.

Email works both directions. [email protected] receives via Cloudflare routing, sends via Resend SMTP through Gmail. [email protected] routes to a Gmail label for future corpus capture automation. (One gotcha: the Gmail filter must match on the To: header specifically, not plus addressing; Cloudflare preserves the original To: header when forwarding.)

Corpus structure is in place, and the files are ready to be indexed by the semantic search layer, in scope for next weekend.

The stack:

ComponentChoice
Static siteHugo + PaperMod theme
HostingCloudflare Pages
DNS/RedirectsCloudflare
Email receiveCloudflare Email Routing
Email sendResend SMTP via Gmail
Large filesGit LFS

Total cost beyond domains I already owned: zero. Time invested: about six hours, maybe seven counting the troubleshooting detours.

What worked, what didn’t

What worked: agent-assisted stack selection saved days of research; the three-domain architecture is clean and future-proof; Resend for email was the right call over Brevo; the session workflow with project log, transcripts, and handover documents makes multi-session projects tractable; AI visibility configuration was straightforward once I knew it was needed; and the PDF-to-markdown migration via Claude was genuinely magical (feed it printouts, get structured content back).

What didn’t, or at least caused friction: Cloudflare’s UI has changed enough that Claude’s knowledge was outdated, requiring manual exploration; default settings are consistently wrong for this use case (robots.txt blocking AI crawlers, email services injecting tracking pixels); browser redirect caching made debugging slow. The friction was never insurmountable, but it was real, and pretending otherwise would be dishonest.

What’s next

Weekend 2 is about making the corpus queryable. Install Khoj locally, point it at the corpus and my chaotic Downloads folder, configure the MCP server, and test whether Claude can actually pull context while I’m writing. That’s when the thesis gets validated or refuted.

If you’re building something similar: the stack is free, portable, and under your control. That’s worth the friction.


This is article 2 of “The Residue,” a series about building a personal knowledge corpus with AI-native retrieval. Article 1 explains the problem being solved. Article 3 will cover Khoj and MCP integration.

This article was written by Claude from the project log and architecture documents generated during the work, with one editing pass for anecdotes and philosophical asides. The corpus feeding back into the writing: that’s rather the point.