LlamaIndex token streaming with FastAPI is surprisingly easy because the core StreamingResponse abstraction in FastAPI is built for exactly this kind of generator-based output.

Let’s see it in action. Imagine you have a simple LlamaIndex query engine set up to answer questions from a document.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.response.notebook_utils import display_response
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
import asyncio

# Load documents and build index
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

app = FastAPI()

async def stream_response(query_text: str):
    response = await query_engine.astream_response(query_text)
    for text in response.text_chunks:
        yield text
        await asyncio.sleep(0.01) # Simulate streaming delay

@app.get("/stream-question/")
async def stream_question(query_text: str):
    return StreamingResponse(stream_response(query_text), media_type="text/plain")

# To run this:
# 1. Save the code as main.py
# 2. Create a directory named 'data' and put some text files in it.
# 3. Install necessary libraries: pip install llama-index fastapi uvicorn
# 4. Run with: uvicorn main:app --reload

When you send a request to /stream-question/?query_text=Your%20question%20here, the FastAPI server will immediately start sending back tokens as LlamaIndex generates them, rather than waiting for the entire response to be completed.

The problem this solves is the perceived latency in LLM applications. Users expect real-time feedback, and waiting for a full response can feel like the application is frozen. Streaming addresses this by providing a continuous flow of text, making the interaction feel much more dynamic and responsive. Internally, LlamaIndex’s astream_response method returns an asynchronous generator. This generator yields chunks of text as they are produced by the LLM. FastAPI’s StreamingResponse is designed to consume exactly these kinds of asynchronous generators, sending each yielded item over the HTTP connection as it arrives.

The key levers you control are the query_text and, within LlamaIndex, your QueryEngine configuration (like similarity_top_k, response_mode, etc.) and potentially how you process the text_chunks before yielding them. The asyncio.sleep(0.01) in the example is purely illustrative to make the streaming effect more obvious in a demo; in a real application, you’d likely remove it or adjust it based on your LLM provider’s actual token generation speed.

The most surprising part for many is how little code is actually needed to achieve this. The magic isn’t in complex network protocols or custom serialization; it’s in aligning the asynchronous generator output of LlamaIndex with FastAPI’s built-in StreamingResponse. You’re essentially plumbing an async generator directly into an HTTP response stream.

The next challenge is usually handling the client-side reception and display of these streamed tokens, often involving JavaScript EventSource or WebSockets to process the incoming stream in real-time.

Want structured learning?

Take the full Llamaindex course →