The most surprising thing about graph link prediction is that it often works best when you don’t try to predict the most obvious links.

Imagine you have a Neo4j graph representing users and their friendships. You want to predict who might become friends in the future. A common approach is to use the Graph Data Science (GDS) library in Neo4j. Let’s see it in action.

First, you need to install the GDS plugin if you haven’t already. Then, you’ll load your graph into GDS.

// Load the graph into GDS, creating a graph projection named 'myGraph'
CALL gds.graph.project(
    'myGraph',
    'User', // Node labels to project
    ['FRIENDS'] // Relationship types to project
);

Now, let’s run the linkPrediction.randomWalk algorithm. This algorithm simulates random walks on the graph. If two nodes are frequently visited in the same random walk, it suggests they are structurally similar and therefore likely to connect.

// Run the Random Walk link prediction algorithm
CALL gds.linkPrediction.randomWalk(
    'myGraph',
    {
        relationshipWeightProperty: 'weight', // If you have relationship weights
        maxDepth: 3, // How far the random walks go
        walksPerNode: 10, // Number of walks starting from each node
        concurrency: 4 // Number of threads to use
    }
)
YIELD
    node1, node2, score
RETURN
    gds.util.asNode(node1).userId AS user1,
    gds.util.asNode(node2).userId AS user2,
    score
ORDER BY score DESC
LIMIT 20;

This query will return pairs of users and a score indicating the likelihood of them connecting. The score is derived from the number of times node1 and node2 co-occur in random walks. A higher score means they are more likely to be linked.

The GDS library offers several link prediction algorithms, each with a different way of quantifying similarity:

  • randomWalk: As shown above, it leverages random walks.
  • commonNeighbors: Predicts links based on the number of shared neighbors between two nodes. Simple but effective.
  • jaccard: Similar to common neighbors, but normalizes the count by the total number of unique neighbors of both nodes. This accounts for nodes with very high degrees.
  • adamicAdar: Another neighbor-based algorithm that gives more weight to common neighbors with lower degrees, assuming they represent more meaningful connections.
  • preferentialAttachment: Assumes that nodes with higher degrees are more likely to form new connections. It’s a simple heuristic where the score is the product of the degrees of the two nodes.

The true power of link prediction isn’t just finding any potential link; it’s about finding meaningful potential links. For example, if you’re predicting new research collaborations, simply picking the two most popular researchers (preferential attachment) might not be as insightful as finding two researchers who, despite having moderate individual popularity, share a surprisingly large number of specific, niche collaborators (Adamic-Adar or Jaccard). The GDS algorithms allow you to tune these different similarity metrics.

When you’re tuning maxDepth in randomWalk, for instance, you’re not just setting a number. You’re defining the "radius" of structural influence you want to consider. A maxDepth of 2 might only capture direct friends-of-friends, while a maxDepth of 4 could reveal deeper, indirect community structures that suggest a potential link. The optimal maxDepth is a hyperparameter that depends entirely on the structure of your graph and what kind of relationships you’re trying to uncover.

The relationshipWeightProperty is crucial if your graph has weighted edges. Without it, all relationships are treated equally. By specifying a weight (e.g., strength of a past interaction, duration of a friendship), you can bias the algorithms to consider stronger existing connections as more indicative of future ones. For randomWalk, this weight can influence the probability of traversing a particular edge.

Many people assume link prediction is about finding the most connected nodes. In reality, it’s often about identifying nodes that are structurally proximate in a non-obvious way. For example, two researchers who haven’t published together but are both deeply embedded in the same, small sub-community of a field might be more likely to collaborate than two highly cited but broadly connected individuals. The GDS algorithms, by exploring different facets of graph structure (shared neighbors, random walk paths, degree distributions), help uncover these nuanced relationships.

After you’ve identified potential links, the next logical step is often to use these predictions to recommend content or connections to users, which leads into graph recommendation systems.

Want structured learning?

Take the full Neo4j course →