LangChain’s text splitters are the unsung heroes of Retrieval Augmented Generation (RAG), and tuning them is the difference between a RAG system that hallucinates wildly and one that’s remarkably accurate.

Let’s see a RecursiveCharacterTextSplitter in action. Imagine you have a document about the history of the internet.

from langchain_text_splitters import RecursiveCharacterTextSplitter

text = """
The history of the internet began with the development of electronic computers in the 1950s.
Early concepts of packet switching, a key technology for the internet, were developed in the early 1960s.
The ARPANET, the precursor to the modern internet, was established by the Advanced Research Projects Agency (ARPA) of the U.S. Department of Defense in 1969.
It initially connected four university nodes: UCLA, Stanford Research Institute (SRI), UC Santa Barbara, and the University of Utah.
The first message sent over ARPANET was "LO" on October 29, 1969, intended to be "LOGIN" but the system crashed.
By the mid-1970s, ARPANET had grown significantly, and protocols like TCP/IP were developed, laying the groundwork for interoperability.
The term "internet" itself began to be used in the 1970s, referring to a network of networks.
In 1983, ARPANET officially switched to the TCP/IP protocol suite, a pivotal moment that cemented its role as the foundation of the modern internet.
The National Science Foundation Network (NSFNET) was created in the mid-1980s to connect university supercomputer centers, and it eventually supplanted ARPANET as the backbone of the internet.
The World Wide Web, invented by Tim Berners-Lee at CERN in 1989, further revolutionized the internet by introducing hyperlinks and a user-friendly interface.
The first web browser, Mosaic, was released in 1993, making the web accessible to a wider audience.
Commercialization of the internet began in the early 1990s, leading to rapid growth and innovation.
"""

# Initialize the splitter with specific chunk size and overlap
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,  # Maximum characters per chunk
    chunk_overlap=30,  # Number of characters to overlap between chunks
    length_function=len,
    is_separator_regex=False,
)

# Split the text
chunks = text_splitter.split_text(text)

# Print the number of chunks and the first few chunks
print(f"Number of chunks: {len(chunks)}")
for i, chunk in enumerate(chunks[:3]):
    print(f"Chunk {i+1}:\n{chunk}\n---")

This code will output something like:

Number of chunks: 5
Chunk 1:
The history of the internet began with the development of electronic computers in the 1950s.
Early concepts of packet switching, a key technology for the internet, were developed in the early 1960s.
The ARPANET, the precursor to the modern internet, was established by the Advanced Research Projects Agency (ARPA) of the U.S. Department of Defense in 1969.
---
Chunk 2:
It initially connected four university nodes: UCLA, Stanford Research Institute (SRI), UC Santa Barbara, and the University of Utah.
The first message sent over ARPANET was "LO" on October 29, 1969, intended to be "LOGIN" but the system crashed.
By the mid-1970s, ARPANET had grown significantly, and protocols like TCP/IP were developed, laying the groundwork for interoperability.
---
Chunk 3:
The term "internet" itself began to be used in the 1970s, referring to a network of networks.
In 1983, ARPANET officially switched to the TCP/IP protocol suite, a pivotal moment that cemented its role as the foundation of the modern internet.
The National Science Foundation Network (NSFNET) was created in the mid-1980s to connect university supercomputer centers, and it eventually supplanted ARPANET as the backbone of the internet.
---

The core problem text splitters solve is that large language models (LLMs) have context windows – a limit on how much text they can process at once. If you feed an entire book into an LLM, it will simply truncate it or refuse to process it. For RAG, this means we need to break down our knowledge base into smaller, digestible pieces (chunks) that can be efficiently retrieved and fit within the LLM’s context window.

LangChain’s RecursiveCharacterTextSplitter is the workhorse. It tries to split text by a list of characters, in order. If it can’t split by the first character (e.g., '\n\n'), it moves to the next (e.g., '\n'), then to a single character (' '), and so on, recursively. This ensures that it breaks text at meaningful boundaries like paragraphs or sentences rather than mid-word. The chunk_size parameter dictates the maximum number of characters a chunk can contain, and chunk_overlap is crucial for maintaining context. When a chunk ends, the overlap ensures that the beginning of the next chunk contains some of the end of the previous one. This prevents important information from being split right at the boundary and lost to the retrieval process.

The chunk_size and chunk_overlap are the primary levers. A smaller chunk_size means more chunks, potentially leading to more granular retrieval but also more overhead and a higher chance of missing broader context. A larger chunk_size means fewer, larger chunks, which might capture more context but could exceed the LLM’s context window or dilute specific information. The chunk_overlap helps mitigate the loss of context at chunk boundaries. If your chunks are too short, or the overlap too small, a query might retrieve two chunks that are semantically related but don’t have the bridging information needed for a coherent answer. For instance, if chunk 1 ends with "He then decided to," and chunk 2 starts with "pursue his career in law," without overlap, the LLM might not connect "He" to "pursue his career." With an overlap of "decided to pursue," the connection is clear.

A subtle but powerful aspect of RecursiveCharacterTextSplitter is its inherent ability to handle different document structures. When you provide a list of separators like ['\n\n', '\n', ' ', ''], it’s not just splitting by characters. It’s attempting to split by larger semantic units first (double newlines, often paragraph breaks), then smaller ones (single newlines, often line breaks within paragraphs), then spaces (word breaks), and finally, if absolutely necessary, it will split even mid-word (though this is rare and usually indicates very dense, unstructured text). This recursive, ordered approach is why it’s often more effective than a simple fixed-size split.

The choice of chunk_size depends heavily on the nature of your data and the embedding model you’re using. Embedding models have their own input token limits, and you want your chunks to be small enough to be effectively represented by these embeddings. A common starting point for many general-purpose embedding models is a chunk_size between 500 and 1000 characters, with an chunk_overlap of 10-20% of the chunk_size. However, for highly technical documents with lots of jargon, you might need smaller chunks to isolate specific terms. For narrative text, larger chunks might be better.

The next challenge you’ll likely face is how to effectively query these chunks, which often involves semantic search and re-ranking mechanisms.

Want structured learning?

Take the full Langchain course →