The most surprising thing about integrating custom embedding models with LlamaIndex is that you’re not just swapping out one vector store for another; you’re fundamentally changing how your documents are understood and compared.

Let’s see this in action. Imagine we have some text documents and we want to embed them using a smaller, specialized model like all-MiniLM-L6-v2 instead of the default, often larger, models LlamaIndex might pick.

First, we need to install the necessary libraries:

pip install llama-index transformers torch sentence-transformers

Now, we can load our documents and set up a HuggingFaceEmbedding instance pointing to our chosen model.

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader

# Load documents
documents = SimpleDirectoryReader("./data").load_data() # Assuming you have a 'data' directory with your text files

# Define the custom embedding model
embed_model = HuggingFaceEmbedding(model_name="all-MiniLM-L6-v2")

# Configure LlamaIndex to use this embedding model
Settings.embed_model = embed_model

# Build the index
index = VectorStoreIndex.from_documents(documents)

# Now, querying this index will use 'all-MiniLM-L6-v2' for all embeddings
query_engine = index.as_query_engine()
response = query_engine.query("What is the main topic of the documents?")
print(response)

Here, Settings.embed_model = embed_model is the key. It tells LlamaIndex to use our HuggingFaceEmbedding instance, which is configured with all-MiniLM-L6-v2, for all embedding operations: when indexing documents and when creating embeddings for user queries. This means the semantic similarity calculations for retrieval will be based on the vector space generated by this specific model.

The problem this solves is multifaceted. Default embedding models are often general-purpose and can be computationally expensive or overkill for specific tasks. By using a custom model, you can:

  • Reduce latency: Smaller models are faster to run.
  • Lower costs: If you’re using API-based models, smaller models are cheaper. If self-hosting, less compute is needed.
  • Improve performance on niche tasks: A model fine-tuned for a specific domain might outperform a general-purpose model.
  • Control data privacy: Using local HuggingFace models means your data never leaves your environment.

Internally, LlamaIndex maintains a Settings object that acts as a global configuration for various components, including the embed_model. When you call VectorStoreIndex.from_documents() or index.as_query_engine(), LlamaIndex checks Settings.embed_model. If it’s set, it uses that instance; otherwise, it falls back to its default. The HuggingFaceEmbedding class is a wrapper around the sentence-transformers library, abstracting away the model loading and inference details. It takes your text, passes it to the specified HuggingFace model, and returns a list of dense vectors.

The exact levers you control are the model_name parameter in HuggingFaceEmbedding, which can point to any model available on the Hugging Face Hub, and potentially other parameters of the embedding model itself if the wrapper allows them. You can also use LlamaCPPEmbedding or OpenAIEmbedding if you prefer different execution environments or providers.

When you specify model_name="all-MiniLM-L6-v2", LlamaIndex doesn’t just download the model weights once. It caches them locally. Subsequent instantiations of HuggingFaceEmbedding with the same model_name will reuse the already downloaded weights, speeding up initialization. This is crucial for performance if you’re frequently creating new embedding models.

The next concept you’ll likely encounter is how to handle different embedding models for different parts of your data, perhaps using a hybrid approach where some documents are embedded with one model and others with another, or how to fine-tune your chosen embedding model for even better domain-specific performance.

Want structured learning?

Take the full Llamaindex course →