LlamaIndex Production: Deploy RAG Apps with FastAPI (2026)

LlamaIndex doesn’t actually build your RAG app for you; it provides the plumbing to connect your LLM, your data, and your query engine.

Here’s a basic FastAPI app that uses LlamaIndex to serve RAG queries:

import os
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage
from llama_index.llms.openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

# --- Configuration ---
LLM_MODEL = os.getenv("LLM_MODEL", "gpt-3.5-turbo")
PERSIST_DIR = "./storage"

# --- LlamaIndex Setup ---
def get_index():
    if not os.path.exists(PERSIST_DIR):
        # This part would typically involve loading documents and creating the index
        # For this example, we assume the index is already created and persisted.
        # In a real app, you'd have a separate script or process for index creation.
        raise FileNotFoundError(
            "Index not found. Please run an index creation script first."
        )
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)
    return index

# Initialize LLM
llm = OpenAI(model=LLM_MODEL, temperature=0.1)

# Get the index (this will be done once at startup)
try:
    index = get_index()
    query_engine = index.as_query_engine(llm=llm)
except FileNotFoundError as e:
    print(f"Error: {e}")
    # In a production scenario, you might want to gracefully handle this
    # or ensure the index is created before the app starts.
    query_engine = None # Set to None to indicate failure

# --- FastAPI App ---
app = FastAPI()

class QueryRequest(BaseModel):
    query: str

@app.get("/")
def read_root():
    return {"message": "RAG API is running. Send POST requests to /query."}

@app.post("/query")
async def query_rag(request: QueryRequest):
    if query_engine is None:
        raise HTTPException(status_code=500, detail="RAG engine not initialized.")

    try:
        response = query_engine.query(request.query)
        return {"response": str(response)}
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"An error occurred during query: {str(e)}")

# To run this app:
# 1. Make sure you have LlamaIndex, FastAPI, Uvicorn, OpenAI, and python-dotenv installed:
#    pip install llama-index llama-index-llms-openai fastapi uvicorn python-dotenv
# 2. Set your OPENAI_API_KEY in a .env file.
# 3. Create and persist an index. For example, using a separate script:
#    from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext
#    documents = SimpleDirectoryReader("your_data_directory").load_data()
#    index = VectorStoreIndex.from_documents(documents)
#    index.storage_context.persist(persist_dir="./storage")
# 4. Run the FastAPI app:
#    uvicorn your_script_name:app --reload

This app exposes a single /query endpoint. When you send a POST request with a JSON body like {"query": "What is the capital of France?"}, it uses LlamaIndex to retrieve relevant information from its index and generate an answer using an LLM. The PERSIST_DIR points to where LlamaIndex stores the index on disk, allowing it to load the index quickly on startup instead of re-indexing every time.

The core problem LlamaIndex solves here is abstracting the complexity of RAG. Instead of manually managing vector databases, chunking documents, embedding text, orchestrating LLM calls, and handling retrieval logic, LlamaIndex provides a unified interface. You load your data, tell LlamaIndex what LLM to use, and it handles the rest, exposing a simple query method.

Internally, when query_engine.query(request.query) is called:

Embedding the Query: The user’s query string is embedded into a vector using the same embedding model that was used to embed the documents in the index.
Vector Similarity Search: This query vector is used to search the vector store for the most similar document chunks (vectors). This is the "retrieval" part of RAG.
Context Augmentation: The retrieved document chunks, along with the original query, are formatted into a prompt.
LLM Call: This augmented prompt is sent to the LLM (e.g., OpenAI’s GPT-3.5-turbo).
Response Generation: The LLM generates an answer based on the provided context and query.

The levers you control are primarily:

LLM_MODEL: Which LLM to use (e.g., gpt-4, gpt-3.5-turbo, or even local models).
PERSIST_DIR: Where the index is stored. Crucially, this allows for pre-computation of the index, saving significant time and cost on startup.
temperature: A parameter for the LLM that controls randomness. Lower values (like 0.1) make output more deterministic and focused; higher values make it more creative.
Embedding Model: While not explicitly set in this minimal example, LlamaIndex uses a default embedding model (often from OpenAI) which can be customized. This significantly impacts retrieval quality.
Chunking Strategy: How your original documents are split into smaller pieces before embedding. This is configured during index creation and is critical for RAG performance.

A surprising aspect is how little control you have over the exact retrieval process once the index is built. LlamaIndex abstracts away the nitty-gritty of vector database queries, embedding similarity metrics, and the number of chunks retrieved. While you can influence these through advanced configurations during index creation or query engine instantiation (e.g., similarity_top_k for the number of chunks to retrieve), the default behavior is often a black box to the end-user of the query_engine. This simplicity is a feature for rapid development but can be a hurdle for deep optimization.

The next step you’ll likely encounter is managing multiple indexes or more complex data loading and preprocessing pipelines.