LangChain’s RecursiveCharacterTextSplitter is often the default choice for chunking, but it’s a blunt instrument that can destroy semantic meaning by splitting mid-sentence or mid-thought.

Let’s see this in action. Imagine we have a document with a few distinct sections, and we want to split it using the default RecursiveCharacterTextSplitter and then see how a more semantically aware splitter, like SentenceTransformersTokenSplitter, handles it.

from langchain_text_splitters import RecursiveCharacterTextSplitter, SentenceTransformersTokenSplitter
from langchain_core.documents import Document

text1 = """
The quick brown fox jumps over the lazy dog. This is the first sentence.
This is the second sentence, and it continues the thought.
This is the third sentence, introducing a new idea.

The lazy dog, however, was not impressed. It barely twitched an ear.
This is another sentence about the dog.
And a final sentence for this paragraph.

Finally, a completely separate topic begins here. AI is transforming industries.
Machine learning models are at the core of this revolution.
Deep learning, a subset of machine learning, has seen remarkable advances.
"""

# Using the default RecursiveCharacterTextSplitter
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)
recursive_chunks = recursive_splitter.split_text(text1)

print("--- RecursiveCharacterTextSplitter Chunks ---")
for i, chunk in enumerate(recursive_chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")

# Using SentenceTransformersTokenSplitter
# We'll use a smaller model for demonstration, though a larger one is generally better for RAG
sentence_splitter = SentenceTransformersTokenSplitter(
    model_name="all-MiniLM-L6-v2",
    chunk_overlap=20,
    chunk_size=50 # Smaller chunk size for demonstration of semantic splitting
)
sentence_chunks = sentence_splitter.split_text(text1)

print("\n--- SentenceTransformersTokenSplitter Chunks ---")
for i, chunk in enumerate(sentence_chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")

When you run this, you’ll immediately notice the difference. The RecursiveCharacterTextSplitter might break sentences in awkward places, leading to chunks like "This is the second sentence, and it continues the" or "Machine learning models are at the core of this". The SentenceTransformersTokenSplitter, however, aims to keep complete sentences or coherent semantic units together, even if it means slightly exceeding the chunk_size or having a slightly different overlap.

The core problem LangChain chunking strategies solve is preparing unstructured text for retrieval-augmented generation (RAG). Large Language Models (LLMs) have context window limits, and you can’t just feed them an entire book. Chunking breaks down large documents into smaller, manageable pieces that can be embedded and stored in a vector database. When a user asks a question, relevant chunks are retrieved and passed to the LLM as context. The quality of these retrieved chunks directly impacts the quality of the LLM’s answer.

Internally, RecursiveCharacterTextSplitter works by trying to split text using a list of characters (like \n\n, \n, , ``). It starts with the broadest separator and recursively tries narrower ones if chunks are still too large. This is fast and simple but doesn’t understand language. SentenceTransformersTokenSplitter, on the other hand, leverages a sentence transformer model. It first splits the text into sentences, then groups those sentences into chunks. Crucially, it uses token counts (which are more aligned with LLM input limits than character counts) and the semantic similarity captured by the sentence transformer embeddings to decide where to break chunks, trying to maintain semantic coherence.

The key levers you control are chunk_size and chunk_overlap. chunk_size determines the maximum number of tokens (or characters, depending on the splitter) per chunk. A smaller size means more chunks, potentially more granular retrieval but also more overhead. A larger size means fewer chunks, less overhead, but potentially retrieving irrelevant information if a chunk contains multiple distinct ideas. chunk_overlap is vital for context. When a chunk is retrieved, information from the preceding and succeeding chunks can help the LLM understand the boundaries of the retrieved context. A common overlap is 10-20% of the chunk size.

The most surprising thing to most users is that the chunk_size in many token-based splitters is an upper bound, not a strict limit. The splitter prioritizes keeping semantic units (like sentences) intact. If a single sentence exceeds the chunk_size, it will still be included as a chunk on its own. This is a feature, not a bug, designed to prevent semantically meaningless splits, but it can lead to some chunks being larger than your specified chunk_size.

The next concept you’ll encounter is how to evaluate the effectiveness of your chunking strategy, moving beyond just visual inspection to quantitative metrics for retrieval accuracy.

Want structured learning?

Take the full Langchain course →