How the "Ask Me" Bot on This Site Actually Works
A small RAG-style assistant that answers questions about me, built with no vector database, no framework, and a model bill measured in cents — here's the whole architecture, and the guardrails that keep a public endpoint from spending my money.
Contents
There’s a little “Ask me anything” box on this site. Type a question — “what has Ken worked with?”, “does he know Kubernetes?”, “is he available?” — and a couple of seconds later you get a straight, two-sentence answer with links to where it came from. No login, no chat history, no “as an AI language model.”
It’s the kind of feature that looks like it needs a vector database, a framework, and a monthly bill. It doesn’t. The whole thing is a single edge function, a JSON file built at deploy time, and the cheapest model I could get away with. Here’s exactly what happens when you use it — and the design decisions I’d defend.
The one bar I refused to lower
Before any code, the requirement: it can only say things that are actually true about me.
A general chatbot bolted onto a personal site is a liability. Ask it where I worked and it’ll happily invent a company, a date, a degree — confidently, in my voice, on my domain. That’s worse than having no bot at all. So the entire design is in service of one rule: answer only from my real writing and profile, and when the answer isn’t there, say so and stop.
Everything below follows from that.
The shape: retrieval, then generation
The pattern is retrieval-augmented generation (RAG), and stripped of the hype it’s three steps:
- Retrieve the handful of passages from my own material most relevant to the question.
- Stuff those passages into the prompt as the only allowed source of truth.
- Generate a short answer grounded in them, with citations.
The model never answers from its training data. It answers from my data, which I control. That’s the whole trick — the model supplies fluency; the facts come from me.
No vector database (and why that’s the right call here)
The reflex move for retrieval is embeddings: convert every passage into a vector, store it in a vector database, and find matches by semantic similarity. It’s powerful, and for this site it would have been over-engineering.
My corpus is small and mine: a few dozen blog posts plus a structured profile (projects, experience, skills, education). For a corpus that size, plain keyword scoring is good enough, cheaper, and has zero moving parts — no embedding model to call, no vector store to host, no index to keep in sync.
So at build time, a route walks my content collection and my profile data, strips the markdown down to plain text, and emits a flat JSON file — search-index.json — full of little chunks that each look like this:
{
"id": "post-how-ask-bot-works",
"title": "How the \"Ask Me\" Bot on This Site Actually Works",
"url": "/blog/how-ask-bot-works",
"text": "A small RAG-style assistant that answers questions about me..."
}
Posts become chunks. So does each project, each job, my skills, my education, my “now” page, and a short “about” blurb. The index is just a static asset served off the CDN.
Retrieval at query time is deliberately dumb. Tokenize the question, drop stopwords, and score every chunk by how often its terms appear — with matches in the title weighted heavier than matches in the body, since a title hit is a strong signal of topical relevance:
// title hits count for more than body hits
for (const term of terms) {
score += countOccurrences(title, term) * 3;
score += countOccurrences(text, term);
}
Take the top handful of chunks, cap the total context so the prompt stays lean, and that’s the retrieval layer. No cosine similarity, no database, no cold-start latency.
The honest tradeoff: keyword scoring is literal. Ask about “k8s” when my posts say “Kubernetes” and it can whiff, because it has no notion that those are the same thing — that’s exactly what embeddings buy you. The day my corpus is big enough, or my phrasing varies enough, that the misses start to matter, I’ll switch. It isn’t there yet, and shipping the simple version that works beats hosting the impressive version that I have to babysit.
Freshness for free
A static index has one obvious risk: it goes stale the moment I publish a post or update my profile. That’s handled by something the site already does — a scheduled job rebuilds and redeploys the whole site once a day, which regenerates search-index.json along with it. Publish a post, and by the next morning the bot can answer questions about it. No separate indexing pipeline; the deploy is the indexing pipeline.
The prompt is the product
People think the model is the hard part. For a bot like this, the system prompt is where the behavior actually lives. Mine is a short list of non-negotiable rules, roughly:
- Answer using only the provided context. Never invent employers, dates, skills, or numbers.
- If the answer isn’t in the context, say: “I don’t have that in Ken’s materials — the contact form is the best way to ask him directly.”
- Keep it to a couple of concise sentences. Refer to me in the third person as “Ken.”
- Don’t mention “context” or “excerpts.” Plain text, no markdown.
That refusal line is the most important sentence in the whole system. A bot that gracefully admits it doesn’t know is worth ten that confidently make something up. Most of the engineering effort in a grounded assistant goes into making “I don’t know” the easy, default path — not into making the answers cleverer.
The user turn is just the question plus the retrieved chunks, clearly labeled as the context to draw from. The model gets fluency; it does not get latitude.
The cheapest model that can do the job
The task is narrow: read a few short passages, write two grounded sentences. That does not need a frontier model. It needs a fast, cheap one that follows instructions — so it runs on Claude Haiku 4.5.
This is a habit worth keeping: match the model to the task, not to the hype. Reserve the big models for genuinely hard reasoning; for bounded, well-specified jobs like this, the small model is faster, cheaper, and just as correct. At my traffic the monthly cost rounds to coffee money, and the answer comes back quickly because the model is small and the prompt is short.
It’s a public endpoint that spends my money
Here’s the part most “build a chatbot” tutorials skip entirely. The instant you put an LLM call behind a public URL, you’ve created a thing that strangers can trigger and you get billed for. Treat it like the liability it is. Layers, cheapest first:
- Same-origin only. Cross-origin POSTs are rejected, so another site can’t quietly embed my endpoint and burn my budget. (Requests with no origin at all —
curl, scripts — fall through to the limiter below rather than getting a free pass.) - Rate limiting. A per-visitor ceiling over a short window, plus a global daily cap across everyone. The per-IP limit stops one person hammering it; the daily cap means that even a determined abuser can only cost me a fixed, known amount before the feature politely shuts off for the day and points to the contact form. (I’m not publishing the exact numbers — that’s just handing out the budget.)
- A human check. Cloudflare Turnstile gates the request, so it’s a person, not a bot farm.
- Hard input limits. The request body is size-capped and the question length is bounded before anything reaches the model — no 50 KB “questions,” no prompt-stuffing.
One design choice worth calling out: the rate limiter fails open. If its backing store isn’t reachable, requests are allowed through rather than blocked. For a personal site that’s the right default — I’d rather the feature stay up than have a transient infra hiccup take it down — but it’s a deliberate “availability over strictness” call, backed by the daily spend cap and Cloudflare’s own protections so “fail open” can’t become “fail expensive.” On a higher-stakes endpoint I’d flip that to fail closed.
None of this is exotic. It’s the boring hygiene that separates a demo from something you can actually leave running on the public internet with your API key behind it.
What I’d change as it grows
The architecture is honest about its size. If this site got ten times bigger, here’s where I’d reach next:
- Embeddings for retrieval, the moment keyword misses start costing real answers.
- Streaming the response, so the answer types out instead of landing all at once.
- An eval set — a fixed list of questions with expected behavior — so I can change the prompt or swap the model and measure whether answers got better or worse, instead of guessing.
I haven’t done those because I don’t need them yet, and adding them now would be building for an imaginary scale. The version that’s running is the version the problem actually calls for.
The point
The “Ask me” bot is maybe a hundred lines of real logic: a build-time index, keyword retrieval, a tight system prompt, a small model, and a stack of unglamorous guardrails. No vector database, no framework, no recurring bill beyond pennies.
That’s the lesson I keep relearning with AI: the impressive-sounding architecture is usually not the one you need. A small, boring, well-guarded tool that does exactly one thing — and refuses to lie about me — beats a clever one I have to babysit. Build the simplest thing that meets the bar, guard it like it’s spending your money, and ship it.
It is, after all, spending my money.
More writing
A Lazy Sunday: Rebuilding My Site with AI and Leaving Wix Behind
How I went from a templated Wix site to an owned, fast, markdown-native stack — thought to design to a live cutover — in about a day, with AI as the pair programmer and me making the calls.
ReadOptimistic, Eyes Open: What AI Actually Does to Us, and How to Use It Well
Does AI make us dumber, or sharper? The honest, fact-checked version — what the research really says about AI and your brain, its real environmental cost, the quiet ways it flatters you, and how to use it well.
ReadMoonLander Enhanced: Teaching an AI to Land on the Moon (and Other Things I Didn't Plan to Build)
A browser remake of the 1979 Atari Lunar Lander that turned into real orbital physics, historic Apollo missions, and an AI Theater where you watch a neural net learn to land in real time. Built with AI, for fun.
Read