LlamaIndex’s batch ingestion can feel like magic for large document sets, but the real trick is how it manages memory and parallelization to avoid bogging down your machine.

Let’s see it in action. Imagine you have a directory full of PDFs, and you want to index them all.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, load_index_from_storage
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.indices.loading import load_index_from_storage
import os

# Define the directory containing your documents
DATA_DIR = "./my_documents"
PERSIST_DIR = "./storage"

# Create dummy documents for demonstration if they don't exist
if not os.path.exists(DATA_DIR):
    os.makedirs(DATA_DIR)
    with open(os.path.join(DATA_DIR, "doc1.txt"), "w") as f:
        f.write("This is the first document. It contains information about apples. Apples are fruits.")
    with open(os.path.join(DATA_DIR, "doc2.txt"), "w") as f:
        f.write("This is the second document. It discusses bananas. Bananas are yellow and grow in bunches.")
    with open(os.path.join(DATA_DIR, "doc3.txt"), "w") as f:
        f.write("The third document talks about cherries. Cherries are small, red, and often used in pies.")

# Check if index already exists
if not os.path.exists(PERSIST_DIR):
    print("Creating new index...")
    # Load documents from the directory
    documents = SimpleDirectoryReader(DATA_DIR).load_data()

    # Configure the node parser
    # We'll use a simple splitter here, but you can customize chunk size and overlap
    node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)

    # Build the index
    # The default is to process documents sequentially, but LlamaIndex handles batching internally
    index = VectorStoreIndex.from_documents(
        documents,
        transformations=[node_parser]
    )

    # Persist the index to disk
    index.storage_context.persist(persist_dir=PERSIST_DIR)
    print(f"Index created and persisted to {PERSIST_DIR}")
else:
    print(f"Loading existing index from {PERSIST_DIR}...")
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)
    print("Index loaded.")

# Now you can query the index
query_engine = index.as_query_engine()
response = query_engine.query("What are some fruits mentioned?")
print("\nQuery Response:")
print(response)

The core idea behind LlamaIndex’s efficient batch ingestion is its SimpleDirectoryReader and the underlying Document and Node objects. When you feed a directory to SimpleDirectoryReader, it doesn’t try to load all files into memory at once. Instead, it iterates through them, creating Document objects. Each Document is essentially a wrapper around your raw text and its metadata (like its file path).

The real magic happens when VectorStoreIndex.from_documents() is called. This method, by default, uses a configurable NodeParser. The NodeParser takes a Document and breaks it down into smaller, more manageable pieces called Nodes. These Nodes are what actually get embedded and stored in your vector store. The NodeParser handles the splitting based on your defined chunk_size and chunk_overlap. LlamaIndex is smart about this process; even with a large number of Documents, it processes them in batches, yielding nodes incrementally rather than holding everything in RAM. This prevents out-of-memory errors and keeps your ingestion process responsive.

The transformations argument is where you plug in your NodeParser. SentenceSplitter is a common choice, but LlamaIndex supports other strategies like TokenTextSplitter and custom parsers. The chunk_size determines how large each Node can be, and chunk_overlap ensures that context isn’t lost between consecutive chunks, which is crucial for accurate retrieval.

Internally, LlamaIndex utilizes a processing pipeline. When from_documents is called, it iterates through the provided documents. For each document, it applies the specified transformations (like node parsing) to generate nodes. These nodes are then processed (embedded) and added to the index. The key to efficiency is that this entire process is often lazy or batched. It doesn’t load all documents, parse them all, embed them all, and then store them all in one go. Instead, it’s a more sequential, chunked operation, especially when dealing with large numbers of documents. The VectorStoreIndex acts as an orchestrator, managing the flow from raw documents to embedded nodes and finally to the persistent vector store.

One aspect that often goes unappreciated is how metadata is handled. When you load documents, LlamaIndex automatically captures metadata like the file_path, file_type, and last_modified timestamp. This metadata is associated with each Document and, consequently, with the Nodes derived from it. This means you can later filter your search results based on these attributes. For example, if you only wanted to retrieve information from documents modified after a certain date, you could leverage this metadata directly within your query.

The next step in mastering large-scale indexing is exploring custom embedding models and advanced storage options.

Want structured learning?

Take the full Llamaindex course →