LangChain’s memory isn’t about remembering past conversations; it’s about selectively forgetting them to manage token limits.
Let’s see how this plays out. Imagine a chatbot using LangChain’s ConversationBufferMemory.
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
memory = ConversationBufferMemory()
conversation = ConversationChain(llm=llm, memory=memory, verbose=True)
print(conversation.invoke("Hi there!"))
print(conversation.invoke("I'm feeling a bit lost today."))
print(conversation.invoke("Can you help me find my way?"))
When you run this, you’ll see verbose=True output showing the LLM’s prompt. Notice how each turn includes all previous messages. The prompt for the third invoke looks something like this:
System: The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, the AI states that it should not do that.
Human: Hi there!
AI: Hello! How can I help you today?
Human: I'm feeling a bit lost today.
AI: I'm sorry to hear that. Can you tell me more about what's making you feel lost?
Human: Can you help me find my way?
AI:
This is ConversationBufferMemory in action. It literally buffers every single message. The problem? LLMs have context windows. gpt-3.5-turbo has a 4k or 16k token limit. If your conversation gets long enough, you’ll hit that limit, and the LLM will start truncating, or worse, error out.
Now, consider ConversationSummaryMemory. Instead of storing every message, it uses the LLM itself to periodically summarize the conversation.
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain
from langchain.memory import ConversationSummaryMemory
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
memory = ConversationSummaryMemory(llm=llm) # Note: llm passed here
conversation = ConversationChain(llm=llm, memory=memory, verbose=True)
print(conversation.invoke("Hi there!"))
print(conversation.invoke("I'm feeling a bit lost today."))
print(conversation.invoke("Can you help me find my way?"))
The verbose=True output for the third invoke will look different. Instead of the full transcript, you’ll see a summary and the latest human input.
System: The following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, the AI states that it should not do that.
AI: The human greeted the AI and expressed feeling lost. The AI inquired for more details about why the human felt lost.
Human: Can you help me find my way?
AI:
The AI: part is the LLM’s response. The AI: line above it is the summary that ConversationSummaryMemory generated and prepended to the prompt. This keeps the prompt size manageable, even with many turns. The LLM doesn’t "remember" every word; it remembers the essence of the conversation.
The core problem these solve is the finite nature of LLM context windows. ConversationBufferMemory is simple and preserves all details but is prone to exceeding token limits. ConversationSummaryMemory trades perfect recall for scalability by using the LLM to distill the conversation’s essence.
Here’s the counterintuitive part: ConversationSummaryMemory is often better for complex, long-running dialogues. While it loses the exact phrasing of early messages, the LLM’s summarization can often capture the intent and key information more effectively than a raw, overflowing transcript. The LLM is less likely to get bogged down in irrelevant details from early turns when it’s presented with a concise summary.
The next hurdle you’ll face is when the summary itself becomes too long or loses critical nuance, leading you to explore memory types like ConversationSummaryBufferMemory or custom solutions.