Streaming LLM output is what makes those chatbots feel alive, but getting it right in production means understanding how the model actually spits out text.

Here’s a typical LLM response flow, not as a stream, but as a whole chunk:

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The quick brown fox jumps over the lazy dog."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 9,
    "total_tokens": 19
  }
}

When you stream, you’re not waiting for that whole content string. Instead, you’re getting a series of smaller JSON objects, each containing a partial content string.

{"choices": [{"delta": {"role": "assistant"}, "index": 0}]}
{"choices": [{"delta": {"content": "The "}, "index": 0}]}
{"choices": [{"delta": {"content": "quick "}, "index": 0}]}
{"choices": [{"delta": {"content": "brown "}, "index": 0}]}
{"choices": [{"delta": {"content": "fox "}, "index": 0}]}
{"choices": [{"delta": {"content": "jumps "}, "index": 0}]}
{"choices": [{"delta": {"content": "over "}, "index": 0}]}
{"choices": [{"delta": {"content": "the "}, "index": 0}]}
{"choices": [{"delta": {"content": "lazy "}, "index": 0}]}
{"choices": [{"delta": {"content": "dog."}, "index": 0}]}
{"choices": [{"delta": {"finish_reason": "stop"}, "index": 0}]}

The delta field is key. It tells you what changed since the last chunk. When role appears, it’s usually the first chunk, indicating the assistant is about to speak. When content appears, it’s a piece of the generated text. When finish_reason appears, the generation is complete.

To implement this in Python using the openai library, you’d set stream=True in your API call.

import openai

openai.api_key = "YOUR_API_KEY"

stream = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": "Tell me a short story."}
    ],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.get("content"):
        print(chunk.choices[0].delta.content, end="")

This simple loop iterates through the stream object. Each chunk is a dictionary-like object. We check if the delta field contains content and, if so, print it. The end="" prevents extra newlines between tokens.

The real magic happens when you need to handle this on the client side (your web app, for instance). You’ll be receiving these chunks over HTTP. A common pattern is to use Server-Sent Events (SSE). Your backend server receives the streamed chunks from the LLM API and then forwards them to the client as SSE messages.

Here’s a conceptual Python Flask example for the backend:

from flask import Flask, Response, request
import openai
import json

app = Flask(__name__)
openai.api_key = "YOUR_API_KEY"

@app.route('/stream_chat', methods=['POST'])
def stream_chat():
    user_message = request.json.get('message')
    if not user_message:
        return Response("No message provided", status=400)

    def generate_events():
        stream = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": user_message}],
            stream=True,
        )
        for chunk in stream:
            yield f"data: {json.dumps(chunk)}\n\n"

    return Response(generate_events(), mimetype="text/event-stream")

if __name__ == '__main__':
    app.run(debug=True, port=5001)

On the client side (JavaScript in a browser), you’d use the EventSource API:

const eventSource = new EventSource("http://localhost:5001/stream_chat");

eventSource.onmessage = function(event) {
    const data = JSON.parse(event.data);
    if (data.choices[0].delta.content) {
        document.getElementById("output").innerText += data.choices[0].delta.content;
    }
    if (data.choices[0].delta.finish_reason) {
        eventSource.close();
        console.log("Stream finished.");
    }
};

eventSource.onerror = function(err) {
    console.error("EventSource failed:", err);
    eventSource.close();
};

This client-side code sets up a listener for messages. When a message arrives, it parses the JSON, checks for content, and appends it to an HTML element with id="output". It also listens for the finish_reason to close the connection.

The most surprising true thing about streaming LLM output is that the content field in the delta object can be empty even when it’s not the final chunk. This happens with control tokens or when the model is processing internally before emitting more text. Your streaming logic should be robust enough to handle these empty content deltas gracefully, simply by doing nothing, rather than expecting a token every time.

The next hurdle is managing conversational history efficiently within a streaming context, especially when dealing with long conversations that exceed token limits.

Want structured learning?

Take the full Llm course →