LangChain Production Deploy: Docker and FastAPI Setup (2026)

LangChain applications, when moved beyond local development, often hit a wall when it comes to packaging and serving.

Let’s see LangChain in action, not as a theoretical concept, but as a running service. Imagine we have a simple RAG (Retrieval Augmented Generation) application.

# main.py
from fastapi import FastAPI
from langchain.chains import RetrievalQA
from langchain_community.llms import OpenAI
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OpenAIEmbeddings
from langchain.prompts import PromptTemplate

# --- Configuration ---
# In a real app, these would be loaded from environment variables or a config file
OPENAI_API_KEY = "sk-..." # Your actual OpenAI API key
PERSIST_DIR = "./chroma_db"
COLLECTION_NAME = "my_rag_collection"
PROMPT_TEMPLATE = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context: {context}

Question: {question}

Helpful Answer:"""

# --- Initialization ---
# Initialize LLM
llm = OpenAI(temperature=0.7, openai_api_key=OPENAI_API_KEY)

# Initialize Embeddings
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

# Load Vector Store
vector_store = Chroma(
    collection_name=COLLECTION_NAME,
    persist_directory=PERSIST_DIR,
    embedding_function=embeddings
)

# Create Retriever
retriever = vector_store.as_retriever()

# Create Prompt
prompt = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=["context", "question"])

# Create QA Chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff", # "stuff" is simple, but others exist like "map_reduce"
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt},
    return_source_documents=True # Useful for debugging and understanding
)

# --- FastAPI App ---
app = FastAPI()

@app.post("/ask/")
async def ask_question(question: str):
    result = qa_chain({"query": question})
    return {
        "answer": result["result"],
        "source_documents": [doc.page_content for doc in result["source_documents"]]
    }

# --- Helper to build the vector store (run this once) ---
# For demonstration, we'll add some dummy data. In production, this would be a separate script
# or a background process that ingests documents.
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter

def build_vector_db():
    try:
        # Check if the collection already exists to avoid re-indexing
        vector_store.get_collection(name=COLLECTION_NAME)
        print(f"Collection '{COLLECTION_NAME}' already exists. Skipping DB build.")
        return
    except:
        print(f"Collection '{COLLECTION_NAME}' not found. Building DB...")
        # Load documents
        with open("sample_doc.txt", "w") as f:
            f.write("LangChain is a framework for developing applications powered by language models. It enables applications that are context-aware and able to interact with their environment.")
        loader = TextLoader("sample_doc.txt")
        documents = loader.load()

        # Split documents
        text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
        texts = text_splitter.split_documents(documents)

        # Create and persist vector store
        Chroma.from_documents(
            documents=texts,
            embedding=embeddings,
            persist_directory=PERSIST_DIR,
            collection_name=COLLECTION_NAME
        )
        print(f"Vector DB built and persisted to {PERSIST_DIR}")

if __name__ == "__main__":
    build_vector_db() # Ensure DB is ready before starting the server
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

And here’s a Dockerfile to package it:

# Dockerfile
FROM python:3.10-slim

WORKDIR /app

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

COPY . .

# Expose the port FastAPI will run on
EXPOSE 8000

# Command to run the application using uvicorn
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

And a requirements.txt:

fastapi
uvicorn
langchain-core
langchain-community
langchain-openai
chromadb
python-dotenv # Good practice for managing keys

To run this locally:

Save the Python code as main.py.
Save the Dockerfile as Dockerfile.
Create requirements.txt with the listed packages.
Create an empty file named sample_doc.txt.
Build the Docker image: docker build -t langchain-app .
Run the Docker container: docker run -p 8000:8000 -e OPENAI_API_KEY="sk-..." langchain-app (replace sk-... with your actual key).

Now you can send a POST request to http://localhost:8000/ask/ with a JSON body like {"question": "What is LangChain?"} and receive a JSON response.

The core problem LangChain addresses is orchestrating complex LLM workflows. It provides abstractions for components like LLMs, prompt templates, document loaders, text splitters, vector stores, and chains. A "chain" is the fundamental concept here – it’s a sequence of calls, often involving an LLM, that can take an input and produce an output. The RetrievalQA chain, for example, first retrieves relevant documents from a vector store based on a query and then passes those documents along with the original query to an LLM to generate an answer.

Internally, the RetrievalQA chain in our example does this:

It takes the user’s question.
It uses the retriever (which is configured to query the Chroma vector store) to find documents semantically similar to the question.
It formats these retrieved documents and the original question into a prompt string using the PromptTemplate.
It sends this formatted prompt to the llm (OpenAI).
It receives the result from the LLM and returns it, along with the source_documents.

You control the behavior through various parameters:

temperature on the OpenAI LLM: Controls randomness. Higher means more creative, lower means more deterministic.
chain_type in RetrievalQA: Determines how documents are processed. "stuff" packs all documents into a single prompt, which is simple but can hit token limits. "map_reduce" processes documents in chunks and then reduces them, better for large contexts.
chunk_size and chunk_overlap in CharacterTextSplitter: Dictate how your source documents are broken down before being embedded and stored. Crucial for retrieval quality.
persist_directory for Chroma: Where your vector embeddings are stored on disk. Essential for not re-indexing every time.
The PromptTemplate itself: This is your primary tool for guiding the LLM’s response.

The most counterintuitive aspect of LangChain’s production readiness is that its "chains" can be arbitrarily nested and composed, forming a directed acyclic graph (DAG) of operations. This means you can build sophisticated agents that don’t just retrieve and answer, but can also use tools (like calling other APIs, running code, or performing database lookups) based on the LLM’s decision-making. You define these tools, and the agent LLM can dynamically choose which tool to use, execute it, observe the result, and then decide on the next step. This makes LangChain far more powerful than a simple sequential processing pipeline.

The next logical step after getting your LangChain app running in Docker is managing its state, especially the vector database, across container restarts.