RAG for Builders: The Mental Model You Actually Need
A clear technical model for Retrieval-Augmented Generation: when to use it, where it fails, and what to measure.
View all rag depths →Depth ladder for this topic:
You’ve built a chatbot that answers questions about your product documentation. It works great on the demo. Then you deploy it and within 24 hours a user finds a question where it confidently answers with information that contradicts your actual documentation. Someone else asks about a feature released last month and the bot gives a description from an old version. A third user quotes the bot’s answer in a support ticket, and it’s wrong.
This is the baseline failure mode of LLMs without retrieval. The model knows what it learned during training — not what’s in your current documents. RAG (Retrieval-Augmented Generation) is the architecture that fixes this, and understanding it properly is what separates teams that ship reliable AI features from teams that ship impressive demos.
What RAG actually is
RAG is not a feature, a product, or a library. It’s an architectural pattern: instead of asking the LLM to recall information from training, you retrieve relevant information from your own documents and inject it into the prompt as context. The model’s job becomes synthesis and reasoning — not recall.
The pipeline has five stages:
1. Ingest your source documents. PDFs, markdown files, database records, web pages — whatever contains the knowledge your system needs to answer questions. Ingest is often underinvested; teams push raw files without thinking about quality. Garbage in, garbage out applies directly to RAG.
2. Chunk documents into retrievable units. You can’t stuff an entire document into context for every query (it’s expensive and degrades quality). Chunking breaks documents into pieces — a few hundred to a few thousand tokens each — that can be retrieved individually when relevant. Good chunking preserves semantic coherence: a chunk should be a complete idea, not an arbitrary slice of a page.
3. Embed chunks into a vector space. An embedding model converts each chunk into a numerical vector that represents its semantic meaning. Similar concepts produce similar vectors. This is what enables semantic search — finding chunks that are meaningfully relevant to a query, not just chunks that share keywords.
4. Retrieve the most relevant chunks for a given query. The user’s question is embedded using the same model, and the system finds the chunks whose vectors are closest to the query vector. Typically retrieve top-5 to top-10, then apply a reranking step to select the best 3–5 for the final context window.
5. Generate the answer by providing the retrieved chunks as context to the LLM. The prompt looks something like: “Using only the following documentation: [chunks]. Answer this question: [user query]. If the answer isn’t in the documentation, say so.”
Input → process → output: a real example
Scenario: A B2B SaaS company builds a support chatbot that answers questions about their product, grounded in their documentation.
User query: “How do I set up SSO with Okta?”
Retrieval step:
- Query embedding is computed
- Vector search retrieves 8 candidate chunks from the documentation corpus
- Reranker selects 3 highest-relevance chunks: the SSO configuration guide, the Okta-specific integration page, and the FAQ about authentication errors
Generation step: The LLM receives:
You are a helpful support assistant for [Product]. Answer the user's question
using ONLY the documentation provided below. If the answer is not in the
documentation, say "I don't have that information — please contact support."
Documentation:
[Chunk 1: SSO Configuration Overview — 340 tokens]
[Chunk 2: Okta Integration Guide — 520 tokens]
[Chunk 3: Authentication FAQ — 280 tokens]
User question: How do I set up SSO with Okta?
Output: Step-by-step SSO setup instructions grounded in the actual current documentation, with a note pointing to the specific guide for more detail.
The model didn’t recall this from training. It read the relevant documentation and synthesized an answer. This is why RAG handles version-specific, company-specific, and recently-updated content where a base LLM would hallucinate.
Where teams fail first
Bad chunking. This is the most underestimated failure mode. Teams chunk by page (too long, semantically mixed), by sentence (too short, loses context), or by arbitrary token count with no regard for paragraph boundaries. The result: retrieved chunks are confusing, redundant, or missing the key context needed to answer the question. Good chunking is an editorial decision as much as a technical one — each chunk should be able to stand alone as a meaningful unit.
No metadata filtering. Not all documents are equally relevant to all queries. A chunk from a 2022 blog post is less authoritative than a chunk from your current documentation. A chunk tagged for “Enterprise tier” shouldn’t be served to Basic tier users. Without metadata (date, source type, product version, permission level), your retrieval is a flat pile. Add metadata to every chunk and filter before ranking.
No evaluation set. You cannot improve a RAG system you haven’t measured. The teams that build RAG systems without an evaluation set are optimizing by feel — and feel is a terrible guide for retrieval quality. Build a set of 50–100 question-answer pairs from your real documentation before you write a line of retrieval code. Use it to measure every significant change to your pipeline.
Treating RAG output as truth. RAG reduces hallucination — it doesn’t eliminate it. The model can misread a retrieved chunk, misapply information to the wrong context, or cite a chunk as supporting something it doesn’t say. RAG output is evidence-backed synthesis, not verified fact. Keep humans in the loop for high-stakes queries and return citations with every answer.
Practical defaults that work
These aren’t universal — but they’re the right starting point for most RAG systems:
Chunk size: 300–800 tokens. Stay in this range unless you have a specific reason to go higher. Larger chunks reduce the precision of retrieval; smaller chunks lose context.
Overlap: 10–20% overlap between adjacent chunks prevents losing information at boundaries. A concept that starts at the end of one chunk will appear at the beginning of the next.
Retrieval: Retrieve top-5 to top-10 candidates, then rerank to top-3 to top-5 for the context window. The retrieval step casts wide; the reranker narrows. Using both consistently outperforms using either alone.
Always return citations. Every answer should include the source chunks or document references used. This enables user verification, simplifies debugging, and builds trust. There’s no good reason not to include them.
Instruction for uncertainty: Explicitly instruct the model to say “I don’t have that information” rather than guessing. “If the answer is not in the provided documents, say so directly” is one of the most important lines in a RAG system prompt.
Metrics that actually matter
Without measurement, you’re optimizing vibes. These four metrics give you a real signal:
Retrieval recall@k. For a set of questions where you know the answer is in your corpus, what percentage of the time does the correct chunk appear in the top-k retrieved results? This measures retrieval quality independently of generation. If recall@5 is 60%, you’re missing the right chunk 40% of the time before the LLM even sees it — no amount of prompt engineering fixes this.
Groundedness. Does the generated answer actually match the retrieved chunks? A grounded answer cites claims that appear in the context. An ungrounded answer makes claims that aren’t supported by the retrieved chunks — this is RAG-specific hallucination. Groundedness can be evaluated automatically (using another LLM as a judge) or via human review.
Answer usefulness. Do users get what they need? Groundedness doesn’t guarantee usefulness — a perfectly grounded answer can still fail to address the actual question. Collect human ratings (thumbs up/down, 1–5 scale) on real answers. This is the ultimate measure.
Latency and cost per query. RAG adds steps to every query: embedding, vector search, reranking. These have latency and cost implications. Measure p50 and p95 latency for the full pipeline, and track cost per query as you scale.
RAG vs fine-tuning: the decision
Teams frequently ask whether to use RAG or fine-tune the model. The answer depends on what problem you’re actually solving.
Use RAG when:
- Knowledge changes frequently and needs to stay current (documentation, policies, product updates)
- Answers must be traceable to specific source documents
- You need to add a new knowledge domain without retraining
- Cost of maintaining a fine-tuned model is too high
Use fine-tuning when:
- The main issue is behavioral — how the model responds, not what it knows
- You need consistent style, tone, or format that prompting alone can’t reliably produce
- You have high-quality labeled examples of the exact behavior you want
- The knowledge domain is stable and comprehensive training data exists
The most capable production systems often use both: fine-tuned behavior (so the model reliably follows your format and tone) plus RAG for current factual knowledge. They solve different problems.
The common path to production
Most teams go through the same evolution:
Week 1–2 (prototype): Simple retrieval with naive chunking, no reranking, no evaluation set. Good enough for internal demos. Not good enough to show customers.
Week 3–4 (basic production): Better chunking strategy, metadata added, evaluation set built and measured, citations returned with answers. Acceptable for low-stakes internal use.
Month 2+ (reliable production): Reranking added, retrieval metrics tracked, groundedness evaluated, failure modes documented and addressed, update pipeline for keeping the corpus current. Ready for customer-facing deployment.
Don’t skip the middle stage to rush to launch. The teams that do are the ones with the chatbots that confidently cite outdated documentation.
Pitfalls and failure modes
Not updating the corpus. A RAG system is only as current as its document corpus. If your documentation updates but your RAG corpus doesn’t, you’re serving stale information confidently. Build a corpus update pipeline as part of the initial system, not as a future improvement.
Context window stuffing. Larger context windows exist, but filling them with loosely relevant chunks degrades generation quality. Focus on precision: 3–5 highly relevant chunks consistently outperform 20 loosely relevant ones.
Over-relying on semantic similarity. Vector search finds semantically similar chunks. But user queries often have specific constraints (a version number, a date, a product tier) that semantic similarity doesn’t handle. Combine semantic retrieval with metadata filtering for better precision.
No graceful degradation. What happens when the retrieval step returns no relevant chunks? Many systems either hallucinate or produce a confusing non-answer. Design an explicit fallback: “Based on our documentation, I couldn’t find an answer to this. Please contact support.” Users handle honest uncertainty far better than a confident wrong answer.
Bottom line
RAG is a search-and-reasoning architecture. The LLM is the reasoning layer. Your leverage as a builder is entirely in the other layers: how well you curate and chunk documents, how precisely you retrieve, and how rigorously you measure.
Build your evaluation set before you write your first chunk. Measure retrieval quality before you debug generation quality. Return citations in every answer. Update your corpus as your knowledge base changes.
The teams winning with RAG in production are the ones treating it like a data system with AI on top — not an AI system that happens to use data.
Simplify
← RAG Freshness and Staleness: The Part Builders Underestimate
Go deeper
RAG in Production: Architecture Decisions That Actually Matter →
Related reads
Stay ahead of the AI curve
Weekly insights on AI — explained at the level that's right for you. No hype, no jargon, just what matters.
No spam. Unsubscribe anytime. We respect your inbox.