Retrieval-augmented generation is not just “put documents in a vector database.” A reliable RAG system is a chain of lossy transformations: documents become chunks, chunks become embeddings, embeddings become ranked candidates, candidates become context, and context becomes an answer. Every stage can lose signal.
Chunk around decisions, not arbitrary token counts
Token windows matter, but semantic boundaries matter more. If a policy document has an exception, the exception must travel with the rule. If a tutorial has a prerequisite, the prerequisite must travel with the command. I prefer chunking that preserves local hierarchy: title, section, paragraph, code block, and source URL. The chunk should answer “where did this come from?” without asking the model to infer provenance.
{
"id": "security-policy:auth:invite-revocation",
"heading_path": ["Security Policy", "Authorization", "Invites"],
"text": "Revoked invitations must not be accepted...",
"source_url": "/docs/security-policy",
"updated_at": "2026-05-22"
}
Use multiple retrieval signals
Dense embeddings are useful, but they are not magic. Production retrieval often needs hybrid scoring: semantic similarity, keyword match, recency, document authority, user permissions, and sometimes product state. A support article updated yesterday may deserve a boost over a stale but semantically similar answer.
score = 0.55 * vector_similarity
+ 0.20 * bm25
+ 0.10 * recency_boost
+ 0.10 * authority_weight
+ 0.05 * user_context_match
The retrieval layer should also return enough candidates for reranking. A reranker can compare the user’s actual question against candidate passages more precisely than the first-stage index.
Evaluate failures as product behavior
A RAG system needs evals before it needs a bigger model. I maintain question sets across four groups: answerable questions, unanswerable questions, adversarially similar questions, and permission-sensitive questions. The last category is critical: the system must not retrieve private context just because it is relevant.
metrics:
retrieval_recall_at_5
citation_precision
answer_groundedness
refusal_accuracy
permission_leak_rate
Make uncertainty visible
Predictable failure is a feature. If retrieval confidence is low, the system should say what it could not verify, ask for a narrower question, or route to a human workflow. The worst answer is a confident paragraph with a weak citation. A better answer exposes uncertainty and lets the user decide the next step.
I think the best AI products feel boring in the best way: they cite sources, decline when they should, and make their limitations legible. Reliability is built less by one clever prompt and more by engineering the whole retrieval pipeline as a system that can be measured.