The most surprising thing about Neo4j’s Graph Data Science (GDS) library is that it’s not just about running algorithms; it’s about preparing your graph to run them efficiently in the first place.
Let’s see it in action. Imagine you’re analyzing a social network. You’ve got Person nodes and FRIENDS_WITH relationships. You want to find the most influential people using PageRank.
First, you need to load your graph into GDS. You don’t just connect to your live Neo4j database. GDS works best with a "projected graph," which is essentially a memory-optimized representation.
// Create a graph projection
CALL gds.graph.project(
'my-social-network', // Name of the projected graph
'Person', // Node labels to include
'FRIENDS_WITH' // Relationship types to include
);
This gds.graph.project call doesn’t actually copy your data. It creates a lightweight, in-memory representation optimized for algorithm execution. The first argument, 'my-social-network', is the name you’ll use to refer to this projected graph for all subsequent algorithm runs. The node labels and relationship types specify what parts of your database are relevant for this analysis.
Now, to run PageRank:
// Run PageRank on the projected graph
CALL gds.pageRank.mutate(
'my-social-network', // Name of the projected graph
{
mutateProperty: 'pagerank_score', // Property to store the results
dampingFactor: 0.85, // Standard PageRank damping factor
maxIterations: 100 // Maximum iterations for convergence
}
);
This gds.pageRank.mutate call takes your projected graph, computes the PageRank score for each Person node, and writes the result back to a new property named pagerank_score directly onto the Person nodes in the projected graph. To get these scores back into your actual Neo4j database, you’d use a separate step:
// Write the results back to the Neo4j database
CALL gds.pageRank.write(
'my-social-network',
{
writeProperty: 'pagerank_score'
}
);
The problem GDS solves is performance. Running complex graph algorithms directly on a transactional Neo4j database is slow because the database is optimized for point lookups and transactional consistency, not for full graph traversals and computations. GDS bypasses the transactional layer, projects the relevant graph structure into memory, and uses highly optimized C++ implementations for algorithms.
The key levers you control are in the gds.graph.project call: which nodes and relationships to include. If your social network has other node types (like Company or Post) or relationship types (WORKS_AT, LIKES), you can exclude them from the projection if they’re not relevant to your specific algorithm. This reduces the memory footprint and speeds up computation. You also control algorithm-specific parameters, like dampingFactor for PageRank, alpha for community detection, or concurrency for parallel execution.
When you run an algorithm like gds.pageRank.mutate, GDS doesn’t just do a single pass. It iteratively updates the scores based on the scores of its neighbors. The dampingFactor represents the probability that a random surfer will continue clicking links versus jumping to a random page. The maxIterations ensures the algorithm terminates even if it doesn’t perfectly converge, preventing runaway computations.
Most people understand that GDS is faster because it uses memory. What they often miss is the explicit separation of concerns: projection, computation, and writing back. You must project the graph first, then run the algorithm on the projection, and then decide if and how to write the results back to your persistent graph. This three-step process, while seemingly verbose, is what allows GDS to manage memory effectively and avoid impacting your live transactional database. It’s not just about "loading" data; it’s about creating a specialized, immutable snapshot for computation.
The next step is often exploring how to combine multiple algorithms or use GDS for feature engineering.