LangChain streaming via WebSocket is surprisingly similar to how you’d stream any data over a WebSocket, with the LLM being just a very specific, albeit computationally intensive, data source.

Let’s see it in action. Imagine we have a simple FastAPI application that acts as our WebSocket server.

from fastapi import FastAPI, WebSocket
from fastapi.responses import HTMLResponse
import uvicorn
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
import asyncio

app = FastAPI()

@app.get("/")
async def get():
    return HTMLResponse("""
    <html>
        <head>
            <title>LangChain WebSocket Stream</title>
        </head>
        <body>
            <h1>LangChain WebSocket Stream</h1>
            <div id="output"></div>
            <script>
                var ws = new WebSocket("ws://localhost:8000/ws");
                ws.onmessage = function(event) {
                    var outputDiv = document.getElementById("output");
                    outputDiv.innerHTML += event.data;
                };
                ws.onopen = function(event) {
                    console.log("WebSocket connection opened");
                };
                ws.onclose = function(event) {
                    console.log("WebSocket connection closed");
                };
                ws.onerror = function(event) {
                    console.error("WebSocket error:", event);
                };
            </script>
        </body>
    </html>
    """)

@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()
    
    # LangChain setup
    llm = ChatOpenAI(model="gpt-3.5-turbo")
    prompt = ChatPromptTemplate.from_messages([
        ("system", "You are a helpful assistant that tells short, interesting facts."),
        ("user", "{topic}")
    ])
    chain = prompt | llm | StrOutputParser()

    while True:
        data = await websocket.receive_text()
        if data == "ping":
            await websocket.send_text("pong")
            continue

        # Use async for streaming
        async for chunk in chain.astream({"topic": data}):
            await websocket.send_text(chunk)
        await websocket.send_text("\n--- END OF RESPONSE ---") # Signal end of message

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

When you run this and navigate to http://localhost:8000/, you’ll see a basic HTML page. Opening your browser’s developer console will show the WebSocket connection. If you were to send a message like "Tell me a fun fact about space" to the /ws endpoint (you could do this with websocat or another client), the LLM’s output would appear on the web page, chunk by chunk, as it’s generated.

The core problem LangChain streaming solves is the latency of LLM calls. A full LLM response can take several seconds. By streaming, you’re not waiting for the entire response before showing anything to the user. You’re sending back pieces of the response as they become available from the LLM. WebSockets are ideal for this because they provide a persistent, full-duplex communication channel, meaning both the server and client can send messages at any time without the overhead of establishing new connections for each piece of data.

Internally, LangChain’s streaming relies on the underlying LLM provider’s API supporting streaming. For OpenAI, ChatOpenAI has an astream method (or stream for synchronous contexts) that yields chunks of the response as they are generated. Our FastAPI application then takes these chunks and immediately forwards them to the connected WebSocket client. The StrOutputParser is crucial here because it converts the LLM’s structured output (often a AIMessageChunk) into a simple string, which is what we’re sending over the WebSocket.

The async for chunk in chain.astream(...) loop is where the magic happens. chain.astream returns an asynchronous iterator. Each await websocket.send_text(chunk) sends that piece of generated text to the browser. The while True loop keeps the WebSocket connection open, ready to receive new prompts or commands. The await websocket.accept() establishes the connection, and await websocket.receive_text() waits for incoming messages from the client.

Crucially, the "chunk" you receive from chain.astream isn’t always a full word or even a character. It’s determined by the LLM’s internal buffering and how the API decides to segment the output. This means the user sees text appearing in bursts, not necessarily character-by-character or word-by-word, but much faster than waiting for the whole sentence. The HTMLResponse is just a minimal example; in a real app, you’d likely use JavaScript to append chunks to a specific DOM element, perhaps with some CSS to make the streaming effect look good.

The most surprising thing about this setup is how little "streaming logic" you actually need to write in your application code. LangChain abstracts away the complexities of interacting with the LLM’s streaming API, and WebSocket libraries like FastAPI’s handle the low-level network details. Your primary job becomes connecting the two: iterating over the astream output and sending each yielded chunk over the WebSocket.

The next step in building a robust streaming application would involve handling errors gracefully, perhaps by sending an error message over the WebSocket, and implementing a mechanism to cancel ongoing LLM generation if the user disconnects or sends a new prompt.

Want structured learning?

Take the full Langchain course →