Fine-tuning an LLM actually teaches it new facts by altering its internal weights, while RAG teaches it where to find facts without changing its core knowledge.

Let’s see RAG in action with a simple example. Imagine an LLM that knows a lot about general history but nothing about your company’s internal project "Phoenix."

# Assume 'llm' is a pre-trained LLM and 'vector_db' is a vector database
# with documents about Project Phoenix.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
import os

# Set up API keys (replace with your actual keys or environment variable loading)
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

# Configure LLM and embedding model
Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding()
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)

# Load documents (assuming a 'phoenix_docs' directory with text files)
documents = SimpleDirectoryReader("phoenix_docs").load_data()

# Create a vector index from the documents
index = VectorStoreIndex.from_documents(documents)

# Create a retriever
retriever = index.as_retriever()

# Create a query engine that uses the retriever
query_engine = index.as_query_engine(retriever=retriever)

# Now, ask a question about Project Phoenix
response = query_engine.query("What is the current status of Project Phoenix and what are the main risks identified?")

print(response)

If phoenix_docs/status.txt contained: "Project Phoenix is currently in the Alpha testing phase. The main risks identified include potential integration issues with the legacy system and a tight deadline for the user acceptance testing (UAT) phase."

The output would be: "Project Phoenix is currently in the Alpha testing phase. The main risks identified include potential integration issues with the legacy system and a tight deadline for the user acceptance testing (UAT) phase."

The LLM didn’t learn about Project Phoenix. It queried its knowledge base (the vector store), found the relevant snippet, and presented it.

The core problem RAG solves is the "knowledge cutoff" and the "hallucination" problem in LLMs. LLMs are trained on massive datasets up to a certain point in time. They don’t know anything that happened after their training data was collected. When asked about newer information, they might invent plausible-sounding but incorrect answers (hallucinate). RAG bridges this gap by allowing the LLM to access and use real-time or domain-specific, external information at inference time.

Internally, RAG works in a few steps. First, your external documents are processed, chunked into smaller pieces, and then embedded into numerical vectors using an embedding model. These vectors are stored in a vector database. When a user asks a question, the question is also embedded into a vector. This query vector is then used to search the vector database for the most semantically similar document chunks (using similarity search, like cosine similarity). These retrieved chunks are then combined with the original user question to form a prompt that is sent to the LLM. The LLM uses this augmented prompt, which includes the relevant external context, to generate its answer. The key levers you control are the quality and scope of your external documents, how you chunk them (chunk size and overlap), the choice of embedding model, and the retrieval strategy (how many chunks to retrieve, and how to rank them).

The true power of RAG lies in its ability to provide "explainability" for LLM answers. Because the LLM’s response is directly derived from specific retrieved text chunks, you can often trace why the LLM gave a particular answer by looking at the source documents it accessed. This is a significant advantage over fine-tuning, where the "knowledge" becomes deeply embedded and harder to pinpoint to a specific training example.

The next challenge you’ll face is how to effectively manage the trade-off between retrieval accuracy and response latency when dealing with very large knowledge bases.

Want structured learning?

Take the full Llm course →