BERT and Sentence Transformers can generate text embeddings, but the most surprising thing is that they don’t actually "understand" text in the way humans do; they learn to map semantically similar sentences to nearby points in a high-dimensional vector space by predicting masked words and sentence relationships.

Let’s see this in action. We’ll use the sentence-transformers library, which provides pre-trained models optimized for generating sentence embeddings.

from sentence_transformers import SentenceTransformer
sentences = [
    "This is a good movie.",
    "I really enjoyed the film.",
    "The acting was superb.",
    "This is a bad movie.",
    "I hated the film.",
    "The plot was terrible."
]
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings.shape)

This will output (6, 384), indicating we have 6 sentences, each represented by a 384-dimensional vector. The magic happens when we calculate the cosine similarity between these embeddings.

from sklearn.metrics.pairwise import cosine_similarity
# Calculate similarity between sentence 0 and others
print(cosine_similarity([embeddings[0]], [embeddings[1]])[0][0])
print(cosine_similarity([embeddings[0]], [embeddings[2]])[0][0])
print(cosine_similarity([embeddings[0]], [embeddings[3]])[0][0])
print(cosine_similarity([embeddings[0]], [embeddings[4]])[0][0])
print(cosine_similarity([embeddings[0]], [embeddings[5]])[0][0])

You’ll see values close to 1 for semantically similar sentences (e.g., "This is a good movie." and "I really enjoyed the film.") and values closer to 0 or negative for dissimilar ones. This is because the model was trained on massive datasets, learning patterns of word co-occurrence and contextual relationships that allow it to place related meanings near each other in the embedding space.

The core problem these models solve is representing the meaning of text in a numerical format that machines can process for tasks like similarity search, clustering, and classification. Traditional methods like TF-IDF or Bag-of-Words lose word order and context, leading to less nuanced representations. BERT, and by extension Sentence Transformers, addresses this by using a Transformer architecture with self-attention mechanisms, allowing it to weigh the importance of different words in a sentence relative to each other.

The SentenceTransformer library simplifies the process by providing a convenient API and pre-trained models. Internally, models like 'all-MiniLM-L6-v2' are fine-tuned versions of BERT (or similar architectures like RoBERTa) on specific tasks that optimize for sentence-level similarity. This fine-tuning often involves training on datasets like NLI (Natural Language Inference) or STS (Semantic Textual Similarity) benchmarks, where the model learns to produce embeddings that reflect the relationship between sentence pairs.

When you call model.encode(), the library handles the tokenization, passing the tokens through the BERT layers, and then applying a pooling strategy (like mean pooling) to aggregate the token embeddings into a single sentence embedding. The choice of pooling strategy significantly impacts the resulting embedding. Mean pooling averages the embeddings of all tokens, while CLS pooling uses the embedding of a special [CLS] token that BERT often prepends to sequences. For sentence similarity tasks, mean pooling is generally preferred as it captures the overall meaning of the sentence better than relying on a single token’s representation.

The dimensionality of the embeddings (e.g., 384 for all-MiniLM-L6-v2) is a hyperparameter determined by the model architecture and its training. Higher dimensions can capture more nuance but also increase computational cost and memory usage. Models like all-MiniLM-L6-v2 are designed to strike a balance, offering good performance with a relatively compact embedding size.

The key to understanding why these embeddings work lies in the training objective. BERT’s original pre-training tasks—Masked Language Model (MLM) and Next Sentence Prediction (NSP)—force it to learn contextual word representations. Sentence Transformers then take this foundation and fine-tune it further. For example, a common fine-tuning approach is using Siamese networks. Two identical BERT models process two sentences, and their output embeddings are compared using a similarity metric. The models are trained to minimize the distance between embeddings of similar sentences and maximize the distance between dissimilar ones. This process effectively distills the contextual understanding BERT develops into sentence-level vectors optimized for semantic comparison.

When you use a pre-trained model like 'all-MiniLM-L6-v2', you’re leveraging a model that has already undergone this extensive training and fine-tuning. The specific architecture and training data of that model determine its strengths and weaknesses for different downstream tasks.

The next logical step is to explore how to fine-tune these models on your own domain-specific data to improve performance for your unique use cases.

Want structured learning?

Take the full Huggingface course →