The most surprising thing about chunking text for LLMs isn’t how you split it, but what you choose to split.

Let’s see LlamaIndex’s SentenceSplitter in action. Imagine you have a document about the solar system.

from llama_index.core.schema import Document
from llama_index.core.node_parser import SentenceSplitter

text = """
The Solar System is the gravitationally bound system of the Sun and the objects that orbit it.
It formed 4.6 billion years ago from the gravitational collapse of a giant interstellar molecular cloud.
The vast majority of the system's mass is concentrated in the Sun, with the remaining mass distributed in nine planets.
The nine planets, in order from the Sun, are Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune.
Pluto was formerly considered the ninth planet but is now classified as a dwarf planet.
The inner planets, Mercury, Venus, Earth, and Mars, are terrestrial planets, composed primarily of rock and metal.
The outer planets, Jupiter, Saturn, Uranus, and Neptune, are giant planets, composed primarily of hydrogen and helium.
Jupiter is the largest planet, followed by Saturn, Uranus, and Neptune.
The asteroid belt, located between the orbits of Mars and Jupiter, contains rocky objects.
The Kuiper Belt, a disc-shaped region beyond Neptune's orbit, contains icy bodies, including Pluto.
The Oort Cloud, a theoretical spherical cloud of icy planetesimals, is thought to surround the solar system at distances up to 100,000 AU.
"""

document = Document(text=text)

# Default SentenceSplitter (splits on sentences, max 512 tokens, overlap 20 tokens)
splitter = SentenceSplitter()
nodes = splitter.get_nodes_from_documents([document])

for i, node in enumerate(nodes):
    print(f"--- Node {i+1} ---")
    print(f"Text: {node.get_content()[:100]}...") # Truncate for display
    print(f"Metadata: {node.metadata}")
    print(f"Node ID: {node.id_}")
    print("-" * 10)

This code takes a single Document object and uses SentenceSplitter to break it down into smaller Node objects. Each Node contains a chunk of text and associated metadata. Notice how the Node ID is automatically generated.

The core problem LlamaIndex’s node parsers solve is preparing unstructured text for LLM consumption. LLMs have token limits. You can’t just feed them an entire book. Furthermore, you want to preserve semantic meaning within each chunk. If you split a sentence in half, the meaning is lost. Node parsers, by intelligently chunking text, ensure that each piece fed to an LLM is coherent and retains its original context as much as possible. This is crucial for tasks like retrieval-augmented generation (RAG), where you need to retrieve relevant chunks of information to answer a query.

The SentenceSplitter is the most common. It uses a regular expression to identify sentence boundaries (periods, question marks, exclamation points) and splits the text there. You can control two key parameters:

  • chunk_size: The maximum number of tokens a node can contain. The default is 512.
  • chunk_overlap: The number of tokens to overlap between consecutive chunks. The default is 20.

This overlap is vital. Imagine a sentence at the end of one chunk and the beginning of the next. Without overlap, the LLM might see "The Sun is hot." in one chunk and "Jupiter is a gas giant." in the next, missing the crucial connection if the sentence was actually "The Sun is hot, and Jupiter is a gas giant." The overlap ensures that context bridging these boundaries is retained in at least one of the chunks.

Beyond SentenceSplitter, LlamaIndex offers others:

  • TokenTextSplitter: Splits purely by token count, ignoring sentence structure. Useful if your text is already highly structured or if you need very precise token control.
  • FixedIntervalSplitter: Splits text into fixed-size chunks based on character count, not tokens. Good for very large, uniformly formatted documents where tokenization might be inconsistent.
  • ParagraphSplitter: Splits text into paragraphs. This is often a good default for articles or well-formatted prose, as paragraphs usually represent a coherent idea.

Each splitter has its own strengths. For natural language text, SentenceSplitter or ParagraphSplitter are usually preferred. For code or highly structured data, TokenTextSplitter might be better. The key is to match the splitter to the structure of your input data and the nature of the queries you expect.

The SentenceSplitter’s regular expression (?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s is more nuanced than just splitting on periods. It uses negative lookbehind assertions ((?<!...)) to avoid splitting on abbreviations (like "Mr." or "U.S.A.") or decimal points within words. This is a subtle but powerful detail that prevents common errors in sentence boundary detection, especially in technical or mixed-language texts.

When deciding on chunk size, consider the context window of the LLM you’ll be using. If your LLM has a 4096-token context window, and you’re feeding it retrieved chunks, you don’t want your chunk size to be so large that you can only retrieve one or two chunks, defeating the purpose of retrieval. A common strategy is to set chunk_size to a value that is a significant fraction, but not the entirety, of the LLM’s context window, leaving room for the prompt and the LLM’s own output.

The next step after parsing nodes is often indexing them, which is where you’ll encounter vector stores and embeddings.

Want structured learning?

Take the full Llamaindex course →