LlamaIndex’s VectorStoreIndex can be surprisingly inefficient for high-cardinality lookups if you don’t prune its underlying data structure.
Let’s see this in action. Imagine we have a bunch of documents, and we want to query them using semantic similarity.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.schema import TextNode
import os
# Create dummy documents
if not os.path.exists("data"):
os.makedirs("data")
with open("data/doc1.txt", "w") as f:
f.write("This document is about apples and oranges. Apples are red and round.")
with open("data/doc2.txt", "w") as f:
f.write("This document discusses bananas and grapes. Bananas are yellow and curved.")
with open("data/doc3.txt", "w") as f:
f.write("A third document, talking about pears and plums. Pears are green and bell-shaped.")
with open("data/doc4.txt", "w") as f:
f.write("Yet another document, mentioning kiwis and mangoes. Kiwis are fuzzy and brown.")
# Load documents
documents = SimpleDirectoryReader("data").load_data()
# Create a VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)
# Query the index
query_engine = index.as_query_engine()
response = query_engine.query("What fruits are mentioned?")
print(response)
This seems straightforward. VectorStoreIndex takes your documents, chunks them (by default), embeds them using a text embedding model, and stores these embeddings in a vector store (defaults to an in-memory one). When you query, it embeds your query and finds the most similar document embeddings.
The core problem it solves is enabling semantic search over unstructured text. Instead of keyword matching, you’re matching meaning. This is crucial for RAG (Retrieval Augmented Generation) pipelines where you need to find relevant context for a Large Language Model.
Internally, VectorStoreIndex uses a VectorStore (like SimpleVectorStore for in-memory, or integrations with Pinecone, Weaviate, etc.) to store and retrieve vectors. The retrieval process typically involves a similarity search (e.g., cosine similarity) to find the k nearest neighbors to your query vector. The default chunking strategy is often a fixed-size chunk with overlap, but this is configurable.
SummaryIndex
While VectorStoreIndex is great for retrieval, it doesn’t inherently summarize. If you want an overview of your documents, you’d use SummaryIndex.
from llama_index.core import SummaryIndex, SimpleDirectoryReader
import os
# Assuming the 'data' directory from previous example exists
# Load documents
documents = SimpleDirectoryReader("data").load_data()
# Create a SummaryIndex
index_summary = SummaryIndex.from_documents(documents)
# Query for a summary
query_engine_summary = index_summary.as_query_engine(response_mode="tree_summarize")
response_summary = query_engine_summary.query("Summarize the content of the documents.")
print(response_summary)
SummaryIndex builds a structure (often a tree) of summaries. When you query, it recursively summarizes chunks and then summarizes those summaries until it can produce a final, coherent summary of the entire corpus. The response_mode like "tree_summarize" dictates how it aggregates information.
TreeIndex
TreeIndex is a more general-purpose index that can be used for various tasks, including summarization and question answering. It builds a hierarchical tree structure where leaf nodes are your data chunks and parent nodes represent aggregations or summaries of their children.
from llama_index.core import TreeIndex, SimpleDirectoryReader
import os
# Assuming the 'data' directory from previous example exists
# Load documents
documents = SimpleDirectoryReader("data").load_data()
# Create a TreeIndex
index_tree = TreeIndex.from_documents(documents)
# Query the index (e.g., for a summary using tree_summarize)
query_engine_tree = index_tree.as_query_engine(response_mode="tree_summarize")
response_tree = query_engine_tree.query("What are the main topics discussed?")
print(response_tree)
TreeIndex is the foundation for many other indices, including SummaryIndex. It provides a flexible framework for organizing and querying information hierarchically. The "tree" aspect means that instead of a flat list of embeddings or summaries, you have a branching structure that can be traversed.
The most surprising thing about VectorStoreIndex is how it handles the "last mile" of retrieval. When you ask for the top k results, it’s not just doing a raw vector search. It’s often applying filters and potentially re-ranking mechanisms after the initial vector search, especially when interacting with more sophisticated vector databases. This post-processing step is crucial for relevance but can obscure the raw performance of the underlying vector store itself if not understood.
If you find your VectorStoreIndex queries are slow and you’re not using a dedicated vector database, consider that the default in-memory SimpleVectorStore might be performing a linear scan for similarity. For large datasets, this becomes a bottleneck.
You’ll likely next explore how to integrate these indices with LLMs for generative tasks and how to optimize retrieval performance by tuning chunking strategies and choosing appropriate vector stores.