The LlamaIndex Chat Engine doesn’t actually "remember" in the way humans do; it reconstructs context from a history of messages, and how that history is managed is the core of its "memory."
Let’s see it in action. Imagine we have a simple RAG setup:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.chat_engine import CondensePlusContextChatEngine
from llama_index.core.memory import ChatMemoryBuffer
from llama_index.llms.openai import OpenAI
import os
# Assume you have an 'data' directory with some text files
# For this example, let's create a dummy file:
if not os.path.exists("data"):
os.makedirs("data")
with open("data/context.txt", "w") as f:
f.write("The capital of France is Paris. The Eiffel Tower is in Paris.")
# Configure your LLM (replace with your actual API key and model)
Settings.llm = OpenAI(model="gpt-3.5-turbo", api_key="YOUR_OPENAI_API_KEY")
# Load documents
documents = SimpleDirectoryReader("data").load_data()
# Create an index
index = VectorStoreIndex.from_documents(documents)
# Create a chat engine with memory
memory = ChatMemoryBuffer.from_defaults(token_limit=1000) # Small token limit for demo
chat_engine = CondensePlusContextChatEngine.from_defaults(
index.as_query_engine(),
memory=memory,
verbose=True
)
# First query
response1 = chat_engine.chat("What is the capital of France?")
print(f"Response 1: {response1}")
# Second query, relying on previous context
response2 = chat_engine.chat("And what is the famous tower in that city?")
print(f"Response 2: {response2}")
When you run this, you’ll see the CondensePlusContextChatEngine taking the previous turns (user query, AI response) and condensing them into a single query to the underlying index. The ChatMemoryBuffer is what holds these turns.
The problem the chat engine solves is maintaining context across multiple user turns. Without explicit memory management, each query to the index.as_query_engine() would be treated as a fresh, isolated question. The chat engine, using its memory, wraps the LLM interaction to create a conversational flow.
Internally, ChatMemoryBuffer is a simple list of ChatMessage objects, each with a role (e.g., "user", "assistant") and content. When you call chat_engine.chat(), the engine first takes your new message and adds it to the memory. Then, it constructs a prompt for the LLM. For CondensePlusContextChatEngine, this prompt involves:
- A system message (often implicit or part of the LLM’s base instructions).
- A condensed version of the conversation history from memory (generated by the LLM itself, aiming to summarize past turns).
- The current user query, potentially rephrased based on the condensed history.
- Relevant context retrieved from the
index.as_query_engine(). - Finally, the LLM generates its response. After the LLM responds, that response is also added to the memory.
The token_limit in ChatMemoryBuffer.from_defaults(token_limit=1000) is crucial. It’s not a hard limit on the number of messages, but a soft limit on the total tokens the history can consume before truncation. When the memory exceeds this limit, the oldest messages are discarded (or summarized, depending on the memory implementation) to make space. This prevents the prompt from becoming too long and expensive, and also avoids overwhelming the LLM’s context window.
The CondensePlusContextChatEngine is just one type of chat engine. Others, like SimpleChatEngine, might simply prepend the entire message history to the current query without condensation. The choice of engine dictates how the memory is used to form the final prompt sent to the LLM.
What most people don’t realize is that the "condensation" step itself is an LLM call. The CondensePlusContextChatEngine doesn’t just grab old messages; it asks the LLM to summarize them. This means the quality of your conversation history summary depends on the LLM’s ability to condense, and it adds latency and cost to every turn, even if the query is simple.
The next step is often exploring more sophisticated memory types, like ConversationSummaryBufferMemory, which uses an LLM to continuously summarize the conversation rather than just condensing it for each query.