Virtual nodes let you project relationships from existing nodes without actually creating new nodes in your database.
Here’s Neo4j Virtual Nodes in action, projecting relationships from User nodes to Product nodes based on a shared Purchase event:
MATCH (u:User)-[:PURCHASED]->(p:Product)
WITH u, collect(p) AS productsPurchased
CALL {
WITH u, productsPurchased
// Create a virtual node representing the user's purchase history
RETURN id(u) AS __virtualNodeId, // Assign a unique ID to the virtual node
'PurchaseHistory' AS __virtualNodeType, // Define the type of the virtual node
{ userId: id(u), productCount: size(productsPurchased) } AS __virtualNodeProperties, // Properties for the virtual node
// Project relationships from the virtual node to the products
[ (p_in_history:Product) IN productsPurchased | [id(p_in_history) AS __virtualNodeId, 'BOUGHT_BY_USER' AS __virtualRelationshipType, { timestamp: apoc.date.format(datetime(), 'yyyy-MM-dd') } AS __virtualRelationshipProperties] ] AS __virtualRelationships
}
RETURN __virtualNodeId, __virtualNodeType, __virtualNodeProperties, __virtualRelationships
This query finds all User nodes and the Product nodes they’ve PURCHASED. It then uses the apoc.cypher.run procedure (implicitly via the CALL {} block syntax which is a modern way to express this) to construct a virtual node for each user’s purchase history. This virtual node is typed PurchaseHistory and has properties like userId and productCount. Critically, it also projects relationships (BOUGHT_BY_USER) from this virtual node to each Product the user purchased. The __virtualNodeId and __virtualRelationshipType are special identifiers that Neo4j’s virtual node mechanism understands.
The problem virtual nodes solve is efficiently querying aggregated or derived information without the cost of materializing those aggregates as physical nodes. Imagine a social network where you want to see "friends of friends" or a product catalog where you want to see "products bought by users who also bought X." Building these as physical nodes would create a lot of redundant data and a complex graph. Virtual nodes allow you to define these projections on the fly.
Internally, Neo4j processes these virtual nodes by generating the necessary relationships and properties at query time. When the CALL block executes, it creates a temporary, in-memory representation of the virtual node and its associated relationships. This representation is then integrated into the overall query plan, allowing you to traverse from your existing graph into these virtual projections and back. The __virtualNodeId is crucial for Neo4j to uniquely identify these temporary nodes within the context of the query.
The __virtualNodeProperties and __virtualRelationshipProperties allow you to attach arbitrary data to your virtual nodes and relationships. This is incredibly powerful for enriching your projections with contextual information without altering your base schema. For example, you could add a purchaseCount to the virtual PurchaseHistory node or a recommendationScore to the BOUGHT_BY_USER relationship.
The apoc.cypher.run procedure, or the CALL {} block which leverages similar underlying mechanisms, is the workhorse here. It executes a subquery that returns specific columns (__virtualNodeId, __virtualNodeType, __virtualNodeProperties, __virtualRelationships) in a predefined format that Neo4j recognizes for virtual node creation. The __virtualRelationships column is a list of lists, where each inner list defines a relationship: the target virtual node’s ID, the relationship type, and its properties.
You can chain virtual node projections. A virtual node created in one CALL block can be the source for another, enabling complex, multi-layered projections that would be prohibitively expensive or impossible to model with physical nodes. This allows for highly dynamic and exploratory graph analysis.
The next hurdle is understanding how to optimize the performance of these virtual node projections when dealing with very large datasets, as the on-the-fly generation can still incur significant computation.