LLM context windows aren’t just about how much text an LLM can read, but how much it can reason about simultaneously.
Let’s see it in action. Imagine you have a long document and you want to ask questions about it.
from openai import OpenAI
client = OpenAI()
document = """
The quick brown fox jumps over the lazy dog. This is a classic pangram, a sentence that contains every letter of the alphabet.
Pangrams are often used for testing typefaces and keyboards, as they provide a comprehensive sample of characters.
The history of pangrams dates back to the late 19th century, with early examples appearing in typing manuals.
One of the earliest known pangrams is "The quick brown fox jumps over the lazy dog."
Another famous pangram is "Pack my box with five dozen liquor jugs."
The length of a pangram can vary, but the goal is always to include all 26 letters of the English alphabet.
Some pangrams are more natural-sounding than others. The "quick brown fox" is widely recognized for its fluency.
There are also pangrams in other languages, each designed to cover the unique characters of that language's alphabet.
For example, in French, a common pangram is "Portez ce vieux whisky au juge blond qui fume."
In German, "Victor jagt zwölf Boxkämpfer quer über den großen Sylter Deich."
These examples showcase the diversity and creativity involved in crafting these linguistic puzzles.
The practical applications extend beyond mere curiosities; they are essential tools for designers and developers.
"""
# Assume a hypothetical model with a limited context window of 200 tokens
# In reality, you'd use a model like 'gpt-3.5-turbo' or 'gpt-4'
# Tokenize the document (simplified for demonstration)
# In a real scenario, use a proper tokenizer like tiktoken
tokens = document.split()
max_tokens_in_window = 150 # Let's say our model can handle 150 tokens for prompt + completion
if len(tokens) > max_tokens_in_window:
print("Document is too long for the context window. Truncating for demonstration.")
truncated_tokens = tokens[:max_tokens_in_window]
processed_document = " ".join(truncated_tokens)
else:
processed_document = document
# Now, we can construct a prompt and send it to the LLM
prompt = f"Based on the following text, what is the purpose of a pangram?\n\n{processed_document}"
# In a real application, this would be an API call:
# response = client.chat.completions.create(
# model="gpt-3.5-turbo",
# messages=[
# {"role": "system", "content": "You are a helpful assistant."},
# {"role": "user", "content": prompt}
# ]
# )
# print(response.choices[0].message.content)
print("\n--- Simulated LLM Interaction ---")
print(f"Prompt length (approximate tokens): {len(prompt.split())}")
print("Simulated LLM Response: Pangrams are used for testing typefaces and keyboards because they contain every letter of the alphabet.")
print("--------------------------------")
The core problem LLMs face with long inputs is the context window. Think of it as the LLM’s short-term memory. It can only hold and process a certain amount of information at once, measured in tokens (roughly words or sub-words). When your input exceeds this limit, the LLM simply can’t "see" all of it.
This isn’t a bug; it’s a fundamental architectural constraint. The self-attention mechanism, which allows LLMs to weigh the importance of different tokens in relation to each other, has a computational cost that scales quadratically with the sequence length. To keep inference fast and affordable, context windows are capped.
The most common way to deal with this is truncation. You simply chop off the end of your input once it hits the token limit. This is what the example above simulates. It’s fast and easy, but you lose information.
# Example of simple truncation
input_text = "This is a very long piece of text that will definitely exceed the context window limit of most language models. We need to find a way to handle this without losing too much important information. " * 50
max_tokens = 100
truncated_text = " ".join(input_text.split()[:max_tokens])
print(f"Original length: {len(input_text.split())} tokens")
print(f"Truncated length: {len(truncated_text.split())} tokens")
A slightly more sophisticated approach is summarization. You can use another LLM call (or a dedicated summarization model) to condense the early parts of your document before feeding the whole thing into your main LLM. This preserves the gist but might lose specific details.
def summarize_text(text, max_summary_tokens=50):
# In a real scenario, this would be an API call to a summarization model
# For demonstration, we'll just take the first few sentences.
sentences = text.split('. ')
summary = ". ".join(sentences[:2]) + "." # Summarize to roughly 2 sentences
return summary
long_document = "The first part of the document contains crucial setup information. It details the initial parameters and background. The second part discusses the experimental results, which are the main focus. Finally, the third part offers conclusions and future work."
summary_part = summarize_text(long_document)
remaining_part = "The experimental results show a significant improvement. The conclusions are that the new method is effective."
combined_input = f"{summary_part} {remaining_part}"
print(f"Combined input length (approximate tokens): {len(combined_input.split())}")
For tasks requiring deep understanding across the entire document, like complex code analysis or legal contract review, chunking and retrieval (RAG - Retrieval Augmented Generation) is the state-of-the-art. You split your document into smaller, manageable chunks, embed them into vectors, and store them in a vector database. When you ask a question, you retrieve the most relevant chunks based on semantic similarity and then feed only those chunks along with your question to the LLM.
# Conceptual RAG example
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Assume `chunks` is a list of pre-defined text chunks from your document
chunks = [
"The quick brown fox jumps over the lazy dog. This is a classic pangram.",
"Pangrams are used for testing typefaces and keyboards.",
"The history of pangrams dates back to the late 19th century.",
"Pack my box with five dozen liquor jugs."
]
# Load a pre-trained model for embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
chunk_embeddings = model.encode(chunks)
query = "What are pangrams used for?"
query_embedding = model.encode([query])[0]
# Calculate similarity between query and chunk embeddings
similarities = cosine_similarity([query_embedding], chunk_embeddings)[0]
# Get the index of the most similar chunk
most_similar_chunk_index = np.argmax(similarities)
retrieved_chunk = chunks[most_similar_chunk_index]
# Now, feed the retrieved chunk and the query to the LLM
prompt_rag = f"Based on this text: '{retrieved_chunk}', answer the question: {query}"
print(f"\n--- RAG Example ---")
print(f"Query: {query}")
print(f"Retrieved Chunk: {retrieved_chunk}")
print(f"RAG Prompt (approximate tokens): {len(prompt_rag.split())}")
print("Simulated LLM Response: Pangrams are used for testing typefaces and keyboards.")
print("-------------------")
Another technique, often used in conjunction with RAG, is sliding window summarization. Instead of just taking the first part of the document, you process it in overlapping windows. The output of processing one window (e.g., a summary) becomes part of the input for the next window, allowing information to propagate through the entire document without exceeding the context limit at any single step.
When dealing with extremely long sequences, especially for code generation or analysis, hierarchical context can be employed. This involves creating multiple levels of abstraction. For instance, you might first summarize entire files, then use those summaries to inform processing of specific functions, and finally use function-level context for line-by-line analysis.
The most surprising thing about managing LLM context is that sometimes, the order in which you present information within the context window can drastically alter the LLM’s understanding, even if the exact same set of tokens is present. Models often pay more attention to the beginning and end of their context window, a phenomenon known as the "lost in the middle" problem, meaning crucial information placed in the middle of a very long prompt might be effectively ignored.
The next hurdle you’ll likely face is handling multimodal inputs, like combining text with images or audio, which introduces entirely new dimensions of context management.