LlamaIndex Transformations: Build Custom Ingestion Steps (2026)

LlamaIndex transformations are not just a way to process your data; they’re the fundamental building blocks that let you teach your Large Language Model (LLM) how to understand and reason over your unstructured documents.

Imagine you have a pile of PDFs about astrophysics. Before an LLM can answer "What is the event horizon of a black hole?", that raw PDF data needs a lot of work. LlamaIndex transformations are those work steps.

Let’s see this in action. We’ll take a simple text file and apply a couple of transformations to get it ready for querying.

from llama_index.core import Document, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.text_splitter import SentenceSplitter as LegacySentenceSplitter # Example of using a different splitter
from llama_index.core.transformations import (
    ChunkSizeHendler,
    ImageExtraction,
    NodeCaptionizer,
    SentenceDeduplication,
    TitleExtractor,
)

# For demonstration, let's use a simple text document
text = "This is the first sentence. This is the second sentence. This is a duplicate sentence. This is another sentence. And one more for good measure. This is a duplicate sentence."
document = Document(text=text)

# Default transformations often include SentenceSplitter
# Let's explicitly define our transformations
transformations = [
    SentenceSplitter(chunk_size=50, chunk_overlap=10), # Split into chunks
    SentenceDeduplication(), # Remove duplicate sentences
    TitleExtractor(), # Extract title if present (won't do much for plain text)
    # ImageExtraction(), # Would extract images from documents like PDFs/Word
    # NodeCaptionizer(), # Would generate captions for images
    # ChunkSizeHendler(), # Another way to handle chunking, often used with other handlers
]

# Apply transformations
nodes = []
for transform in transformations:
    # The first transformation usually takes Documents and returns Nodes
    # Subsequent transformations take Nodes and return Nodes
    if not nodes:
        processed_nodes = transform.transform([document])
    else:
        processed_nodes = transform.transform(nodes)
    nodes = processed_nodes

# Print the resulting nodes
for i, node in enumerate(nodes):
    print(f"Node {i}: {node.text}")
    print(f"  Metadata: {node.metadata}")

When you run this, you’ll see how the original text is broken down, deduplicated, and how metadata might be added. The SentenceSplitter breaks the text into smaller pieces (nodes), respecting sentence boundaries. SentenceDeduplication then cleans up redundant information.

The core idea is that an LLM doesn’t "read" your documents directly. It needs data to be broken down into manageable, semantically meaningful chunks called "nodes." Each node typically contains a piece of text and associated metadata (like its source file, page number, or extracted title). Transformations are the pipeline that converts raw Document objects into these structured Node objects, ready for indexing.

Here’s the mental model:

Documents: Your raw input data (text files, PDFs, web pages, etc.). LlamaIndex loads these into Document objects.
Transformations: A sequence of operations applied to Documents or Nodes. They are designed to process, enrich, and structure the data.
Nodes: The output of transformations. These are the discrete pieces of information that get embedded and stored in your index.

The Settings.transformations object is where you define the default pipeline. If you don’t specify transformations when creating an index, LlamaIndex uses these defaults. You can override them for specific indexing operations or even apply custom transformations.

The SentenceSplitter is a workhorse. It uses natural language processing to identify sentence boundaries, ensuring that a single sentence isn’t awkwardly split across two nodes. The chunk_size and chunk_overlap parameters are critical. chunk_size dictates the maximum length of a node (in tokens or characters, depending on the splitter), and chunk_overlap ensures that a small amount of text from the end of one node is included at the beginning of the next. This overlap helps maintain context across chunk boundaries, which is crucial for retrieval accuracy.

Consider the TitleExtractor. If you’re ingesting a PDF, it tries to find the document’s title and adds it to the metadata of all nodes derived from that document. This context can be invaluable for retrieval. Similarly, ImageExtraction and NodeCaptionizer are for multimodal data, turning images within documents into nodes with descriptive captions.

When you chain transformations, the output of one becomes the input of the next. This allows for complex data preprocessing pipelines. For instance, you might first extract text and images, then caption the images, then split the text into sentences, and finally deduplicate those sentences.

The one thing most people don’t realize is how deeply the choice of splitter and its parameters affects retrieval. If your chunk_size is too large, a single node might contain too much disparate information, making it hard for the LLM to pinpoint the exact answer. If it’s too small, you lose context, and the LLM might not have enough information within a node to answer a question that spans across multiple original sentences. The overlap parameter is particularly important for complex queries where the answer might be split across the boundary of two chunks; a good overlap ensures that the context needed to understand the complete answer is present in at least one of the retrieved nodes.

After you’ve successfully built your index with transformations, the next challenge is often optimizing the retrieval process itself, which involves understanding retrievers and their interaction with the indexed nodes.