Build Semantic Search with Hugging Face Embedding Models (2026)

The most surprising thing about semantic search is that it doesn’t actually "understand" your query; it just finds text that is statistically similar in a high-dimensional space.

Let’s see it in action. Imagine you have a collection of documents and you want to find the ones most relevant to "how to bake a cake," even if they don’t contain those exact words.

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load a pre-trained sentence transformer model
# 'all-MiniLM-L6-v2' is a good balance of speed and performance
model = SentenceTransformer('all-MiniLM-L6-v2')

# Your documents
documents = [
    "Baking a delicious chocolate cake from scratch.",
    "Tips for making fluffy pancakes for breakfast.",
    "The science behind yeast fermentation in bread making.",
    "A guide to decorating birthday cakes with frosting.",
    "How to properly store fresh produce to avoid spoilage.",
    "Easy recipe for a simple vanilla sponge cake."
]

# Your query
query = "How to make a cake?"

# Encode documents and query into dense vectors (embeddings)
document_embeddings = model.encode(documents)
query_embedding = model.encode([query])[0] # Encode as a list, then take the first element

# Calculate cosine similarity between the query and each document
similarities = cosine_similarity([query_embedding], document_embeddings)[0]

# Get the indices of the top N most similar documents
top_n = 3
top_indices = similarities.argsort()[-top_n:][::-1]

print(f"Top {top_n} most relevant documents for '{query}':")
for i in top_indices:
    print(f"- {documents[i]} (Similarity: {similarities[i]:.4f})")

When you run this, you’ll see that documents about baking cakes, even with different phrasing, are ranked higher than unrelated topics like produce storage or pancake recipes. This works because the SentenceTransformer model, trained on vast amounts of text, has learned to map semantically similar sentences to vectors that are close to each other in a multi-dimensional space. Cosine similarity then measures the angle between these vectors, with smaller angles (higher similarity scores) indicating greater semantic overlap.

The core problem semantic search solves is the limitation of keyword-based search. Traditional search engines rely on exact word matches. If your query doesn’t contain the precise keywords present in a document, that document might be missed, even if its meaning is highly relevant. Semantic search bridges this gap by understanding the meaning or intent behind the words, not just the words themselves.

Internally, models like all-MiniLM-L6-v2 are typically based on transformer architectures. They process input text and output a fixed-size vector (an embedding) that captures the semantic essence of that text. This embedding is a dense numerical representation. The magic happens during training: the model is fed pairs of sentences and learns to push the embeddings of similar sentences closer together while pulling dissimilar sentences further apart in the embedding space.

The exact levers you control are primarily the choice of the embedding model and the similarity metric. Different models (e.g., paraphrase-MiniLM-L3-v2, multi-qa-mpnet-base-dot-v1) are trained on different datasets and for different tasks (like question answering, paraphrasing, general similarity), leading to varying performance on your specific search needs. The similarity metric (cosine similarity, dot product, Euclidean distance) also affects how "closeness" is measured in the embedding space. For most text embedding tasks, cosine similarity is the go-to because it’s invariant to the magnitude of the vectors, focusing solely on their orientation, which aligns well with semantic direction.

When you’re dealing with very large datasets, calculating similarity against every single document embedding in real-time becomes computationally prohibitive. This is where Approximate Nearest Neighbor (ANN) search libraries like FAISS, Annoy, or ScaNN come into play. They build specialized index structures that allow you to find the approximate nearest neighbors much faster than a brute-force linear scan, trading a tiny bit of accuracy for significant speed improvements.

The next concept you’ll want to explore is how to fine-tune these embedding models on your own domain-specific data to further improve relevance and accuracy.