LangChain’s LCEL (LangChain Expression Language) enables streaming LLM responses in production by allowing you to construct LLM chains as a series of steps that can yield partial results as they are computed, rather than waiting for the entire response.
Here’s what a basic LCEL chain looks like, with streaming enabled:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
# 1. Define the LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)
# 2. Define the prompt template
prompt = ChatPromptTemplate.from_messages(
[
("system", "You are a helpful assistant."),
("user", "{input}"),
]
)
# 3. Define the output parser
output_parser = StrOutputParser()
# 4. Create the LCEL chain
chain = prompt | llm | output_parser
# 5. Invoke the chain with streaming enabled
# The `stream()` method is the key here for streaming
for chunk in chain.stream({"input": "Explain the concept of quantum entanglement in simple terms."}):
print(chunk, end="", flush=True)
When you run this, you won’t see the entire explanation appear at once. Instead, you’ll observe tokens (words or sub-word units) appearing on your console as the LLM generates them. This is the essence of streaming: delivering output incrementally.
The Core Problem LCEL Streaming Solves:
The primary challenge with traditional LLM API calls is latency. A user asks a question, and the application waits for the LLM to finish generating the entire response before displaying anything. For complex queries or slower models, this can lead to a noticeable delay, making the application feel unresponsive and frustrating for the user. Streaming bypasses this by showing the user that something is happening and providing them with information as it becomes available. This dramatically improves perceived performance.
How it Works Internally:
LCEL represents chains as a directed graph of components. When you use the stream() method on a runnable (like our chain object), LCEL iterates through the components. For components that support streaming (like ChatOpenAI when configured correctly), they yield chunks of output as they are generated by the underlying LLM API. These chunks are then passed down the chain to subsequent components (like the StrOutputParser in our example, which can also handle streaming input by concatenating chunks). The stream() iterator yields each of these processed chunks to your application.
The Levers You Control:
- Model Configuration: The LLM itself must support streaming. For
ChatOpenAI, this is typically enabled by default if the underlying API call is made in a streaming fashion. Libraries likelangchain-openaihandle the API interaction details. - Chain Construction: LCEL’s pipe operator (
|) is crucial. It ensures that the output of one component becomes the input of the next, and importantly, that thestream()method is propagated correctly through the chain. If any component in the chain does not support streaming or is not invoked correctly, the entire stream might break and fall back to returning a single final output. - Invocation Method: You must use the
.stream()method on your chain or runnable. Using.invoke()or.batch()will wait for the full response. - Client-Side Rendering: Your application’s frontend needs to be able to receive and render these chunks as they arrive. This usually involves technologies like WebSockets, Server-Sent Events (SSE), or similar mechanisms to push data from your backend (where LangChain is running) to the user’s browser.
A More Complex Example with Multiple Steps:
Let’s say you have a chain that first retrieves some information, then formats it, and then generates a response.
from langchain_community.tools.retriever import PubmedRetriever
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
# LLM and Parser remain the same
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)
output_parser = StrOutputParser()
# A tool that can retrieve documents (supports streaming implicitly by yielding docs)
retriever = PubmedRetriever(k=2)
# A prompt that uses retrieved documents
template = """Answer the question based on the following context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
# A function to format documents for the prompt
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
# Build the chain
# RunnablePassthrough allows us to pass the original input while also processing it
# through the retriever and formatter.
chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| output_parser
)
# Invoke with streaming
user_question = "What are the latest advancements in CRISPR gene editing?"
print(f"Question: {user_question}\n")
print("Answer: ", end="", flush=True)
for chunk in chain.stream(user_question):
print(chunk, end="", flush=True)
print() # Newline at the end
In this example, the retriever yields documents. The format_docs function processes these documents. Then, the formatted documents and the original question are passed to the prompt. Finally, the LLM generates the response, and its output is streamed. Notice how the retriever and format_docs steps execute first, and then the LLM starts streaming its response. The output is a blend: retrieval happens, then the LLM response begins to stream.
The most surprising aspect of LCEL streaming is how gracefully it handles components that don’t support streaming. If a component in your chain is synchronous (e.g., a simple string manipulation function that processes the entire input at once), LCEL will execute it to completion, and then if the subsequent component supports streaming, the stream will resume. However, if a non-streaming component is placed after a streaming component and tries to process the stream as a whole, it can break the stream. LCEL’s design aims to make this as seamless as possible, but understanding the flow is key. The RunnablePassthrough is a common pattern to ensure the original input is available for later steps, even after intermediate processing.
The next step in optimizing LLM applications involves managing token costs and latency more proactively, perhaps by implementing techniques like early exit or adaptive sampling.