Building RAG Locally to Actually Understand It
I recently read through Databricks’ tutorial on Retrieval-Augmented Generation (RAG) and it finally clicked why people keep saying “RAG is the real product.” It’s not just “use an LLM.” It’s a pipeline: ingest → chunk → embed → retrieve → prompt → generate.
So I’m building a small RAG system locally—not to ship it, but to understand the mechanics and the limits (CPU, GPU, memory, latency) from first principles.
This post is my “learn-by-building” plan and a simple baseline implementation you can run on a laptop.
What RAG is (in one paragraph)
RAG = give the model relevant context at query time. Instead of hoping the LLM “knows” your docs, you: 1) convert docs into searchable vectors (embeddings), 2) fetch the most relevant chunks (retrieval), 3) stuff those chunks into the prompt (augmentation), 4) generate an answer grounded in that context.
No fine-tuning required. If your docs change, you re-index, not retrain.
The smallest RAG architecture that teaches you everything
Offline (index build):
- Load documents (PDF/text/markdown)
- Chunk them (split into small passages)
- Embed each chunk (turn text → vector)
- Store vectors + metadata (a “vector DB”, even if it’s just a local file)
Online (query):
- Embed the user query
- Search vectors for nearest neighbors (top-k chunks)
- Build a prompt with:
- instructions,
- retrieved chunks,
- user question
- Ask the LLM to answer using only the provided context (or to cite it)
That’s it. Everything else (reranking, hybrid search, caching, evals) is “RAG v2.”
Why I’m doing this locally
Local RAG is great for learning because you feel the tradeoffs:
- Chunk size too big → retrieval gets fuzzy + context fills up fast
- Chunk size too small → you lose meaning + answers get fragmented
- Embeddings too weak → “it retrieves the wrong stuff”
- Context too long → latency spikes + answers get worse, not better
- CPU-only → embeddings might be fine, generation can be slow
- GPU → fast generation, but VRAM becomes the hard ceiling
The “gotchas” that make or break a RAG system
1) Chunking matters more than you think
Most failures are chunking failures.
Common approaches:
- Fixed token/char windows + overlap (simple and solid)
- Markdown/heading-aware splitting (better for docs)
- Semantic splitting (fancier; not needed for v1)
Rules of thumb:
- Start with ~300–800 tokens per chunk
- Use overlap (~10–20%) so important context isn’t split away
- Store metadata: filename, section heading, page, etc.
2) Retrieval quality is the real product
If retrieval returns junk, the LLM will confidently summarize junk.
You’ll eventually want:
- better embeddings,
- reranking (a second model that re-orders results),
- or hybrid search (keyword + vectors).
But first: build the baseline.
3) Context limits are a budget
Even with long-context models, you should treat context as expensive:
- More context != better
- Irrelevant context actively harms answers
A minimal local RAG baseline
For my baseline I’m keeping the stack simple:
- FAISS for local vector search
- sentence-transformers for embeddings
- Any local LLM for generation (e.g. via Ollama)
You can swap components later. The point is to make the pipeline real.
The flow is straightforward: load your docs, split them into chunks, embed each chunk with a sentence-transformer model, and store the vectors in a FAISS index. At query time, embed the question, find the top-k nearest chunks, pack them into a prompt, and send it to the LLM with instructions to answer using only the provided context.
Even this bare-bones setup teaches you most of what matters—chunking tradeoffs, retrieval quality, and how context length affects answers.