LangChain RAG Pipeline: Build and Deploy in Production (2026)

LangChain’s Retrieval Augmented Generation (RAG) pipeline isn’t just about stuffing documents into a vector database; it’s about dynamically querying that database to fetch relevant context for a large language model (LLM) to answer questions.

Let’s see it in action. Imagine you have a collection of product manuals and you want to build a chatbot that can answer user questions about them.

from langchain_community.document_loaders import DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.llms import OpenAI
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser

# 1. Load Documents
loader = DirectoryLoader('./product_manuals/', glob="**/*.txt")
documents = loader.load()

# 2. Split Documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(documents)

# 3. Create Embeddings and Vector Store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embeddings, persist_directory="./chroma_db")
retriever = vectorstore.as_retriever()

# 4. Set up the LLM and Prompt
llm = OpenAI(model_name="gpt-3.5-turbo-instruct") # Example LLM
prompt = ChatPromptTemplate.from_template("""
Answer the question based only on the following context:
{context}

Question: {question}
""")

# 5. Build the RAG Chain
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# 6. Query the Chain
question = "How do I reset the device to factory settings?"
answer = rag_chain.invoke(question)
print(answer)

This code does several things: it loads text files from a directory, breaks them into manageable chunks, creates numerical representations (embeddings) of these chunks using an LLM provider, and stores them in a vector database (Chroma). When you ask a question, it first finds the most semantically similar document chunks from the vector database (retrieval) and then feeds these chunks, along with your question, to another LLM to generate a coherent answer (generation).

The core problem RAG solves is the LLM’s lack of specific, up-to-date, or proprietary knowledge. LLMs are trained on vast but static datasets. RAG allows you to inject external knowledge into the LLM’s reasoning process at inference time. This means your LLM can answer questions about documents it’s never seen before, or about events that happened after its training cut-off.

Internally, the retriever is the key component for accessing your knowledge base. It takes a query (your question) and uses its embedding model to find the most similar document chunks in the vectorstore. The RunnablePassthrough() in the chain is a bit of a trick; it simply passes the input question through to the next step, which is the prompt. The prompt then formats the retrieved context and the original question into a single input for the llm.

The exact levers you control are primarily in the document loading, splitting, embedding model choice, vector store configuration, and the prompt engineering. For instance, chunk_size and chunk_overlap in RecursiveCharacterTextSplitter directly impact how much information is passed to the LLM and how well context is preserved across chunks. Choosing the right embedding model (e.g., OpenAIEmbeddings, HuggingFaceEmbeddings) is crucial for semantic similarity. The retriever itself can be configured with parameters like search_kwargs={'k': 3} to fetch the top 3 most relevant chunks.

When setting up your vector store, you’re not just storing text; you’re storing embeddings. The quality of these embeddings, determined by the chosen embedding model, directly dictates how well the retriever can find relevant information. A poorly chosen embedding model might map semantically similar concepts to distant points in the embedding space, leading to the retriever fetching irrelevant chunks.

The persistence of the Chroma vector store (persist_directory="./chroma_db") means you don’t have to re-index your documents every time your application restarts, which is critical for production. The retriever object is configured to use this persistent store.

The StrOutputParser() at the end is a simple way to ensure the final output from the LLM is a plain string, which is usually what you want for a chatbot response.

The next logical step after building a basic RAG pipeline is to implement advanced retrieval strategies like re-ranking retrieved documents or using query expansion techniques to improve the relevance of the context provided to the LLM.