Retrieval-Augmented Generation (RAG) systems don’t just answer questions; they reason by looking things up first.

Let’s see RAG in action. Imagine we have a simple document store:

[
  {"id": 1, "content": "The capital of France is Paris. It's known for the Eiffel Tower."},
  {"id": 2, "content": "The capital of Germany is Berlin. It has a rich history and vibrant arts scene."},
  {"id": 3, "content": "The Eiffel Tower is a famous landmark in Paris, France."}
]

And we want to ask: "What is the Eiffel Tower?"

A RAG system would first retrieve relevant documents. It might use a vector search (after embedding the documents and the query) to find documents containing keywords like "Eiffel Tower." In our case, documents 1 and 3 are highly relevant.

Then, it augments the prompt to the Language Model (LLM) with this retrieved information. The prompt becomes something like:

"Based on the following information:

  • Document 1: The capital of France is Paris. It’s known for the Eiffel Tower.
  • Document 3: The Eiffel Tower is a famous landmark in Paris, France. Answer the question: What is the Eiffel Tower?"

The LLM then generates an answer grounded in the provided context, not just its pre-trained knowledge. It would likely output: "The Eiffel Tower is a famous landmark in Paris, France."

The core problem RAG solves is the LLM’s tendency to hallucinate and its limited knowledge cut-off. LLMs are trained on vast datasets, but that data is static. They don’t know about events or information that occurred after their training data was collected. Furthermore, they can sometimes confidently invent facts. RAG injects up-to-date, specific, or proprietary information into the LLM’s decision-making process at inference time.

Internally, a RAG system typically involves these components:

  1. Document Loader: Ingests data from various sources (files, databases, APIs).
  2. Text Splitter: Breaks down large documents into smaller, manageable chunks. This is crucial because LLMs have token limits, and smaller chunks provide more focused context.
  3. Embedding Model: Converts text chunks into numerical vectors (embeddings) that capture semantic meaning. Models like text-embedding-ada-002 or all-MiniLM-L6-v2 are common.
  4. Vector Store: Stores these embeddings and allows for efficient similarity search. Examples include Chroma, FAISS, Pinecone, or Weaviate.
  5. Retriever: Queries the vector store using the embedded user query to find the most relevant text chunks.
  6. LLM (Generator): Receives the user query and the retrieved chunks, then generates a coherent answer.

The exact levers you control are primarily within the Retriever and the Embedding Model. For the Retriever, you tune parameters like k (the number of documents to retrieve) and the similarity search threshold. For the Embedding Model, you choose one that best suits your domain and language. A model trained on legal texts, for instance, might perform better for legal documents than a general-purpose one.

One aspect often overlooked is the chunking strategy. Simply splitting by a fixed number of characters or tokens can break sentences mid-thought or separate related pieces of information. Advanced splitters use techniques like recursive character splitting, which tries to split based on paragraphs, then sentences, then words, preserving semantic coherence. The size of these chunks is a delicate balance: too small, and you lose context; too large, and you might exceed LLM context windows or dilute the relevant information with noise.

The next challenge is handling complex queries that require synthesizing information from multiple, distinct retrieved documents.

Want structured learning?

Take the full Llm course →