LLMs don’t actually "remember" anything about your prompt once the token limit is hit; they just have a sliding window of text they can see.

Let’s watch an LLM handle a complex, multi-turn conversation about a fictional product, "ChronoWatch," before we dive into the details.

from openai import OpenAI

client = OpenAI()

# Initial prompt with ChronoWatch product details
initial_prompt = """
Here are the specs for the ChronoWatch:
- Model: CW-Pro
- Display: AMOLED, 1.4-inch, 454x454 resolution
- Battery: 400mAh, up to 7 days typical use, 2 days with GPS
- Features: GPS, heart rate monitor, sleep tracking, NFC payments, Bluetooth 5.0
- Water Resistance: 5 ATM
- Price: $299
- Warranty: 1 year limited

Please describe the ChronoWatch to a potential buyer, highlighting its key features and benefits.
"""

response1 = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": initial_prompt}
    ]
)
print("--- Initial Description ---")
print(response1.choices[0].message.content)

# Follow-up question, relying on previous context
follow_up_question = "What about its battery life when using GPS extensively?"

response2 = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": initial_prompt}, # Re-sending initial prompt for demonstration
        {"role": "assistant", "content": response1.choices[0].message.content}, # Previous assistant response
        {"role": "user", "content": follow_up_question} # The new user question
    ]
)
print("\n--- Follow-up on GPS Battery Life ---")
print(response2.choices[0].message.content)

# Another follow-up, testing memory of specific details
specific_detail_question = "Is the NFC payment feature available on all ChronoWatch models or just the Pro?"

response3 = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": initial_prompt}, # Re-sending initial prompt
        {"role": "assistant", "content": response1.choices[0].message.content},
        {"role": "user", "content": follow_up_question},
        {"role": "assistant", "content": response2.choices[0].message.content}, # Previous assistant response
        {"role": "user", "content": specific_detail_question} # The new user question
    ]
)
print("\n--- Follow-up on NFC Feature ---")
print(response3.choices[0].message.content)

This code simulates a conversation. Notice how in the second and third calls, we’re not just sending the new question. We’re resending the original product specs and the previous turns of the conversation. This is the core idea behind "stuffing" context. The LLM doesn’t inherently remember the ChronoWatch specs from the first query to the second. It only sees the text provided in the current messages array.

The problem LLMs solve is complex text generation and understanding. They can write essays, summarize documents, translate languages, and answer questions based on the text they’re given. The challenge is that their "memory" is limited by their context window, a fixed number of tokens they can process at once. This is where the distinction between "stuffing" (also known as "in-context learning" or "prompt engineering") and "retrieval-augmented generation" (RAG) becomes crucial.

Stuffing Context (In-Context Learning)

When you "stuff" context, you’re directly including relevant information within the prompt itself. Think of it like giving the LLM all the source material it needs for a specific task right before it performs that task. In our ChronoWatch example, we included the product specifications directly in the initial_prompt.

  • How it works: The LLM reads the prompt, including your stuffed context, and uses that information to generate its response. It’s like a student being given an open book and told to answer questions based on that book.
  • When to use it:
    • Short, self-contained tasks: When the information needed is small enough to fit comfortably within the LLM’s context window.
    • Few-shot learning: Providing a few examples of input/output pairs within the prompt to guide the LLM’s behavior for a specific task.
    • Personalization: Injecting user preferences or specific details about a single item.
  • Example: Asking an LLM to rewrite a paragraph in a specific tone, providing the original paragraph and a description of the desired tone in the prompt.
  • Limitation: The context window size. If you have too much information, you’ll hit the token limit, and the LLM will either truncate the input or refuse to process it. For GPT-4, this can be up to 128k tokens, but for older models or other LLMs, it might be as low as 4k or 8k.

Retrieval-Augmented Generation (RAG)

RAG is a more sophisticated approach for handling large amounts of external knowledge. Instead of stuffing everything into the prompt, RAG systems first retrieve relevant information from a knowledge base (like a database, document store, or vector store) and then augment the LLM’s prompt with only the most pertinent retrieved pieces.

  • How it works:
    1. Indexing: Your knowledge base (e.g., a collection of product manuals, research papers, company wikis) is processed and often converted into vector embeddings. These embeddings capture the semantic meaning of the text.
    2. Retrieval: When a user asks a question, the system converts the question into an embedding and searches the indexed knowledge base for the most semantically similar text chunks (vectors).
    3. Augmentation: The retrieved text chunks are then combined with the original user query and fed into the LLM’s prompt.
    4. Generation: The LLM uses this augmented prompt to generate an answer.
  • When to use it:
    • Large knowledge bases: When your data exceeds the LLM’s context window.
    • Dynamic data: When your information needs to be updated frequently. RAG allows you to update the knowledge base without retraining the LLM.
    • Fact-checking and grounding: To ensure LLM responses are based on specific, verifiable information, reducing hallucinations.
    • Domain-specific Q&A: Building chatbots that can answer questions about a company’s internal documentation or a specific product catalog.
  • Example: A customer service chatbot for a large electronics retailer. The chatbot doesn’t have every product manual stuffed into its prompt. Instead, when a user asks about a specific TV model’s setup, the RAG system retrieves the relevant section from the TV’s manual and provides it to the LLM.
  • The "Surprising" Part: The LLM itself doesn’t "know" where to find information. It’s the retrieval system that acts as a librarian, finding the right pages from the "book" (your knowledge base) and handing them to the LLM "student" to read and answer from. The LLM’s ability to answer is entirely dependent on the quality of the retrieved context.

The key difference is scale and efficiency. Stuffing is direct but limited. RAG is indirect but scalable. With RAG, the retrieval step is critical; if it pulls irrelevant documents, the LLM will generate a nonsensical answer, no matter how good the LLM is.

The next challenge you’ll face is optimizing the retrieval step in RAG: tuning the embedding models, chunking strategies, and similarity search parameters to ensure the most relevant context is consistently retrieved.

Want structured learning?

Take the full Llm course →