LLM Embeddings: Build Semantic Search from Scratch (2026)

An LLM embedding is a dense vector representation of text that captures its semantic meaning, allowing for mathematical comparison of text similarity.

Let’s see this in action. Imagine we have a few product descriptions and want to find the most similar one to a query.

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load a pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Our product descriptions
descriptions = [
    "A comfortable and durable cotton t-shirt, perfect for everyday wear.",
    "High-performance running shoes with advanced cushioning for marathon runners.",
    "A stylish and lightweight laptop bag, designed to protect your electronics.",
    "Soft and breathable linen pants, ideal for warm weather and casual outings."
]

# Generate embeddings for the descriptions
description_embeddings = model.encode(descriptions)

# Our search query
query = "What are good shoes for jogging?"

# Generate embedding for the query
query_embedding = model.encode([query])

# Calculate cosine similarity between the query and each description
similarities = cosine_similarity(query_embedding, description_embeddings)[0]

# Find the index of the most similar description
most_similar_index = similarities.argmax()

print(f"Query: '{query}'")
print(f"Most similar description: '{descriptions[most_similar_index]}'")
print(f"Similarity score: {similarities[most_similar_index]:.4f}")

# Let's try another query
query_2 = "I need a shirt made of natural fabric."
query_2_embedding = model.encode([query_2])
similarities_2 = cosine_similarity(query_2_embedding, description_embeddings)[0]
most_similar_index_2 = similarities_2.argmax()

print(f"\nQuery: '{query_2}'")
print(f"Most similar description: '{descriptions[most_similar_index_2]}'")
print(f"Similarity score: {similarities_2[most_similar_index_2]:.4f}")

This code demonstrates how we can take raw text, convert it into numerical vectors (embeddings) using a Sentence Transformer model, and then use a mathematical operation (cosine similarity) to find which of our original texts are closest in meaning to a new piece of text. The output clearly shows that "High-performance running shoes with advanced cushioning for marathon runners." is the most relevant to "What are good shoes for jogging?", and "A comfortable and durable cotton t-shirt, perfect for everyday wear." is closest to "I need a shirt made of natural fabric."

The core problem LLM embeddings solve is the challenge of comparing text based on its meaning rather than just keyword matching. Traditional search methods rely on exact word matches, which fail to understand synonyms, context, or paraphrased queries. For example, a keyword search for "jogging shoes" wouldn’t naturally connect with "running sneakers" or "athletic footwear." Embeddings, however, represent words, sentences, or even entire documents as points in a high-dimensional space where proximity signifies semantic similarity. This allows systems to understand that "jogging" and "running" are closely related concepts.

Internally, these embeddings are generated by large language models that have been trained on massive datasets of text. During training, the model learns to predict missing words, understand sentence structure, and grasp the relationships between different pieces of text. The embedding layer of such a model essentially acts as a translator, converting discrete text tokens into continuous numerical vectors. The specific architecture of the LLM (e.g., Transformer, BERT, RoBERTa) and the training objective (e.g., masked language modeling, next sentence prediction) influence the quality and characteristics of the resulting embeddings. Different models are fine-tuned for different tasks; some are general-purpose, while others are optimized for specific domains or languages.

The exact levers you control are primarily the choice of the embedding model and the preprocessing of your text. Different models offer varying trade-offs between performance, dimensionality, and computational cost. For instance, all-MiniLM-L6-v2 is small and fast but might not capture nuances as well as a larger model like all-mpnet-base-v2. The input text also matters; cleaning your text by removing stop words, stemming, or lemmatizing can sometimes improve results, but for modern Transformer-based embeddings, minimal preprocessing is often best as the model is designed to handle raw text and context. The dimensionality of the embedding vector (e.g., 384 for all-MiniLM-L6-v2, 768 for all-mpnet-base-v2) also affects storage and computation, but higher dimensions generally capture more semantic information.

A subtle but critical point is that the "semantic space" created by embeddings is not uniform. Some directions in the vector space might correlate with specific attributes. For example, there might be a vector direction that consistently represents "gender" or "sentiment." This property allows for vector arithmetic, like "king - man + woman ≈ queen," which can be used for analogy tasks or to manipulate embeddings to remove unwanted biases or steer them towards desired characteristics. Understanding these emergent properties of the embedding space is key to advanced applications beyond simple similarity search.

The next step after building a semantic search system is often optimizing the retrieval process for large datasets, which typically involves using specialized vector databases.