Neo4j’s vector index lets you do semantic search on your graph data, but it’s not just about finding similar nodes; it’s about finding nodes that are conceptually similar, even if they don’t share direct relationships in the graph.
Let’s see it in action. Imagine we have a graph of movies, actors, and directors. We’ve embedded the plot summaries of these movies into high-dimensional vectors.
// Add a movie node with a plot summary and its vector embedding
CREATE (m:Movie {title: "The Matrix", plot: "A computer hacker learns from mysterious rebels about the true nature of his reality and his role in the war against its controllers."})
WITH m
CALL db.create.setNodeVector(m, "plot_embedding", [0.1, 0.2, 0.3, ..., 0.9]) // Placeholder for actual vector
RETURN m
Now, we want to find movies semantically similar to "The Matrix" based on their plot summaries, not just movies that share actors or directors.
// Find movies semantically similar to "The Matrix"
MATCH (m:Movie {title: "The Matrix"})
WITH db.similarity.cosine(m.plot_embedding, vector.embedding) AS score, m AS query_movie
CALL {
WITH query_movie
MATCH (other:Movie)
WHERE other <> query_movie
WITH db.similarity.cosine(query_movie.plot_embedding, other.plot_embedding) AS score, other
ORDER BY score DESC
LIMIT 5
RETURN other.title AS similar_movie, score
}
RETURN query_movie.title AS query, collect({movie: similar_movie, score: score}) AS similar_movies
This query uses the db.similarity.cosine function to calculate the cosine similarity between the vector embedding of "The Matrix" and the embeddings of all other movies. The results are ordered by similarity score, giving us movies that are conceptually alike.
The problem this solves is the limitation of traditional graph queries. You can easily find movies by the same director or starring the same actors, but discovering movies with similar themes or narratives requires a different approach. Traditional graph traversals can’t capture the nuanced meaning within text data. Vector indexes bridge this gap by allowing you to represent and query unstructured text (like plot summaries, descriptions, or reviews) within the graph itself.
Internally, Neo4j’s vector index uses an Approximate Nearest Neighbor (ANN) algorithm, typically Hierarchical Navigable Small Worlds (HNSW) or similar, to efficiently search through high-dimensional vector spaces. When you create a vector index, Neo4j stores your embeddings in a way that allows for rapid, albeit approximate, retrieval of the most similar vectors. The db.create.setNodeVector procedure is your entry point for associating these vector embeddings with your graph nodes.
The core idea is that vectors close to each other in the high-dimensional space represent semantically similar concepts. Cosine similarity is a common metric because it measures the angle between two vectors, effectively capturing their directional similarity regardless of their magnitude. A score of 1 means identical direction (maximum similarity), 0 means orthogonal (no similarity), and -1 means opposite direction (maximum dissimilarity).
The vector.embedding function is a special placeholder that the database understands within the context of a vector search query. When you use db.similarity.cosine(m.plot_embedding, vector.embedding), Neo4j knows you’re looking for vectors similar to the one represented by vector.embedding. In practice, you’d typically use a pre-computed vector for your query, or generate it on the fly from a search term.
One common misconception is that vector similarity replaces graph relationships. It doesn’t. It augments them. You can combine vector search with graph traversals. For example, you could find actors who have starred in movies semantically similar to "The Matrix," even if those actors haven’t directly worked together on those specific similar movies. This allows for discovery of connections that are not explicitly modeled as direct graph edges.
The real power comes when you start combining these vector search results with your existing graph structure. For instance, after finding semantically similar movies, you might then traverse to the actors and directors involved in those similar movies to find new collaborators or talent pools.
The next step in exploring this feature is to understand how to tune the ANN index parameters for specific recall and performance trade-offs, and how to integrate external embedding models into Neo4j.