LlamaIndex RAG Quickstart: Build a Production Pipeline
The most surprising truth about RAG is that it’s not about finding the best answer, but about finding an answer that’s good enough, fast enough, and cheap enough for your users.
Let’s see this in action. Imagine we have a simple document about a fictional coffee shop, "The Daily Grind."
# The Daily Grind Coffee Shop
## About Us
The Daily Grind is a cozy coffee shop located in downtown Seattle. We specialize in ethically sourced, single-origin beans and offer a variety of brewing methods. Our baristas are passionate about crafting the perfect cup.
## Menu
### Espresso Drinks
- Latte: $4.50
- Cappuccino: $4.25
- Americano: $3.75
### Drip Coffee
- House Blend: $3.00
- Single Origin Pour-over: $4.00
### Pastries
- Croissant: $3.50
- Muffin: $3.25
## Hours
Monday - Friday: 7:00 AM - 6:00 PM
Saturday - Sunday: 8:00 AM - 5:00 PM
Now, let’s build a RAG pipeline with LlamaIndex to answer questions about this document.
First, we need to ingest the document. LlamaIndex can load data from various sources. For this example, we’ll use a simple file.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
import os
# Configure API keys (replace with your actual keys or environment variables)
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
# Set the LLM and embedding model
Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
# Load data from a directory
# Assume the document above is saved as 'daily_grind.txt' in a 'data' directory
documents = SimpleDirectoryReader("./data").load_data()
# Build the index
index = VectorStoreIndex.from_documents(documents)
# Create a query engine
query_engine = index.as_query_engine()
With the index built and a query engine ready, we can ask questions.
User Question: "What are the opening hours on Saturdays?"
response = query_engine.query("What are the opening hours on Saturdays?")
print(response)
Expected Output:
On Saturdays, The Daily Grind is open from 8:00 AM to 5:00 PM.
User Question: "How much is a latte?"
response = query_engine.query("How much is a latte?")
print(response)
Expected Output:
A latte costs $4.50.
This is the core of RAG: you give LlamaIndex your data, it indexes it, and then you can query it using natural language. The magic happens when the LLM, guided by the retrieved document chunks, synthesizes an answer.
The mental model for RAG involves a few key components:
-
Data Loading: This is where you bring your knowledge base into LlamaIndex.
SimpleDirectoryReaderis just one option; you can load from PDFs, websites, databases, Notion, etc. The goal is to get your raw content into a format LlamaIndex can process. -
Indexing: LlamaIndex takes your loaded documents and breaks them into smaller pieces (chunks). For each chunk, it generates an embedding – a numerical representation of its meaning. These embeddings are stored in a vector store. When you ask a question, LlamaIndex embeds your question and uses vector similarity search to find the most relevant document chunks from the vector store. This is why the choice of embedding model (
Settings.embed_model) is crucial; it determines how well semantic similarity is captured. -
Retrieval: After embedding your query, LlamaIndex performs a similarity search against the indexed embeddings. It retrieves the top-k most relevant document chunks based on their vector similarity to your query. These chunks are your context.
-
Synthesis: The retrieved document chunks (the context) are then passed to a Large Language Model (LLM) along with your original question. The LLM uses this context to generate a coherent and relevant answer. The LLM’s capabilities (
Settings.llm) and its prompt engineering determine the quality and format of the final answer.
The "production pipeline" aspect comes from making these steps robust, scalable, and efficient. This involves choosing the right data connectors, optimizing chunking strategies, selecting appropriate embedding and LLM models for your use case and budget, and managing the vector store. For instance, if you have millions of documents, you’ll need a more sophisticated vector database than the default in-memory one. You might also implement strategies for re-ranking retrieved documents or using different LLMs for retrieval and synthesis.
One of the most powerful, yet often overlooked, aspects of RAG is the interplay between chunking strategy and retrieval. By default, LlamaIndex might split documents by a fixed token count. However, for structured data like our coffee shop example, you might want to chunk based on sections or paragraphs. This ensures that when you ask about "latte prices," the retrieved chunk is likely to contain the entire menu item and its price, rather than being split across multiple chunks. Experimenting with different chunk_size and chunk_overlap parameters in SentenceSplitter or using more advanced parsing (like MarkdownNodeParser for Markdown files) can dramatically improve retrieval accuracy without changing the LLM.
The next step in building a production RAG pipeline is often dealing with multiple data sources and implementing advanced retrieval techniques like query transformations or re-ranking.