LlamaIndex’s asynchronous streaming for query responses doesn’t just make things faster; it fundamentally changes how you think about waiting for answers from LLMs.

Imagine you’re asking a question that requires LlamaIndex to fetch data from multiple sources, process it, and then generate a response. Normally, your entire application would freeze, waiting for that complete answer. With async streaming, LlamaIndex starts sending back pieces of the answer as they become available. This means your application can begin processing or displaying the partial answer immediately, making the user experience feel instantaneous even if the full answer takes time.

Let’s see this in action. We’ll set up a simple LlamaIndex query engine and enable streaming.

import asyncio
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Configure settings for LLM and embedding model
Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
Settings.node_parser = SentenceSplitter(chunk_size=1024, chunk_overlap=20)

async def stream_query(query_engine, query_text):
    response_stream = await query_engine.query(query_text, stream=True)
    full_response = ""
    print("Streaming response:")
    async for text in response_stream.async_response_gen():
        print(text, end="", flush=True)
        full_response += text
    print("\n--- End of stream ---")
    return full_response

async def main():
    # Load documents
    documents = SimpleDirectoryReader("./data").load_data() # Assuming you have a 'data' directory with some text files

    # Build the index
    index = VectorStoreIndex.from_documents(documents)

    # Create a query engine
    query_engine = index.as_query_engine()

    # Perform a streaming query
    query = "What are the main topics discussed in the documents?"
    await stream_query(query_engine, query)

if __name__ == "__main__":
    # Create a dummy data directory and file for demonstration if it doesn't exist
    import os
    if not os.path.exists("./data"):
        os.makedirs("./data")
    if not os.path.exists("./data/example.txt"):
        with open("./data/example.txt", "w") as f:
            f.write("This document discusses the importance of asynchronous programming in modern software development. "
                    "It covers benefits like improved responsiveness and efficient resource utilization. "
                    "Another key topic is the role of LLMs in data analysis and retrieval, highlighting their potential to "
                    "streamline complex information processing tasks. The document also touches upon the challenges of "
                    "integrating these technologies effectively.")

    asyncio.run(main())

When you run this, you’ll see the output appear word by word, or sentence by sentence, rather than all at once. This is the LLM generating tokens and LlamaIndex immediately passing them back through the async_response_gen() generator.

The core problem async streaming solves is latency perception. For interactive applications, especially those involving LLMs where response times can be unpredictable and sometimes long, a perceived delay can lead to a poor user experience. By streaming, you give the user something to see and engage with immediately, making the application feel much more responsive. It’s not just about fetching data faster; it’s about how you deliver that data.

Internally, LlamaIndex leverages Python’s asyncio capabilities. When you call query_engine.query(..., stream=True), LlamaIndex doesn’t wait for the LLM to finish its entire generation. Instead, it sets up a streaming response object. This object contains an asynchronous generator (async_response_gen()) that yields chunks of text (tokens or small sequences of tokens) as the LLM produces them. Your application then consumes these chunks iteratively. The Settings.llm object you configure must also support streaming for this to work end-to-end. Most modern LLM integrations in LlamaIndex do.

The key levers you control are:

  1. stream=True: This is the primary flag you pass to the query method to enable streaming.
  2. async_response_gen(): This is the method on the streaming response object that you iterate over asynchronously.
  3. LLM Integration: Ensuring your configured Settings.llm supports streaming. For OpenAI, this is typically handled by setting stream=True in the underlying API call, which LlamaIndex abstracts away.
  4. Chunking and Display Logic: How you handle the incoming text chunks in your UI or processing pipeline. You can append them to a text area, trigger UI updates, or perform further processing on each chunk as it arrives.

What most people don’t realize is that the async_response_gen() doesn’t necessarily yield individual tokens. The LLM provider might buffer a few tokens before sending them. LlamaIndex then yields these "chunks." The exact size and content of these chunks can vary based on the LLM’s internal buffering and the network conditions. You’re not guaranteed a per-token stream, but rather a stream of small, timely updates.

The next logical step after mastering basic async streaming is handling errors gracefully within the stream, ensuring your application doesn’t crash if a partial response fails or the LLM encounters an issue mid-generation.

Want structured learning?

Take the full Llamaindex course →