Link Prediction with Python — Core Concepts

Understand heuristic and machine learning approaches to link prediction — common neighbors, Jaccard index, and supervised models for predicting missing graph edges.

The link prediction problem

Given a graph at time T, predict which edges will appear by time T+1. This applies to any network: social connections, protein interactions, citation links, or financial transactions.

Formally: for every pair of unconnected nodes (u, v), compute a score indicating how likely they are to form a connection. Rank all pairs by score and evaluate the top predictions.

Heuristic approaches

These methods compute scores from graph structure alone — no training required.

Common neighbors

The simplest predictor: count shared neighbors.

score(u, v) = |neighbors(u) ∩ neighbors(v)|

If Alice and Bob have 12 mutual friends, their score is 12. Surprisingly competitive for social networks.

Jaccard coefficient

Normalizes common neighbors by total neighbors to account for popularity:

score(u, v) = |neighbors(u) ∩ neighbors(v)| / |neighbors(u) ∪ neighbors(v)|

A Jaccard score of 0.5 means half of their combined neighborhood is shared. This prevents high-degree nodes from dominating predictions.

Adamic-Adar index

Weights each shared neighbor by the inverse log of their degree. Rare shared neighbors count more than popular ones:

score(u, v) = Σ 1 / log(degree(w)) for w in common_neighbors(u, v)

The intuition: if two people share a friend who only has 3 friends, that’s a stronger signal than sharing a friend who has 3,000 friends.

Preferential attachment

High-degree nodes tend to attract more connections (the “rich get richer” effect):

score(u, v) = degree(u) × degree(v)

Simple but works well in networks with strong preferential attachment dynamics, like citation networks.

Supervised link prediction

Heuristics use one feature. Supervised models combine many features for better accuracy:

Extract features — For each candidate pair (u, v), compute multiple features: common neighbors, Jaccard, Adamic-Adar, shortest path length, community membership, node degrees.
Create labels — Existing edges are positive examples (label 1). Sample non-edges as negative examples (label 0).
Train a classifier — Random forest, gradient boosting, or logistic regression.
Predict — Score all unconnected pairs and rank by probability.

The training/test split must respect time: train on edges from time T, test on edges that appear between T and T+1. Random splits leak future information.

Evaluation metrics

Link prediction is heavily imbalanced — in a graph with n nodes, there are O(n²) possible edges but only O(n) actual edges. Standard accuracy is meaningless.

Appropriate metrics:

AUC-ROC — The probability that a random positive pair scores higher than a random negative pair. Standard benchmark metric.
Average Precision (AP) — Summarizes the precision-recall curve. Better than AUC when you care about the top-k predictions.
Hits@K — Fraction of true positives in the top K predictions. Practical for recommendation systems where only the top results matter.

Negative sampling strategy

Choosing which non-edges to include as negative examples matters:

Random negatives — Sample random unconnected pairs. Easy but may include trivially unlikely pairs (nodes in completely different parts of the graph).
Hard negatives — Sample pairs that are close in the graph (2-3 hops apart) but not connected. Harder to classify, producing better models.

A 1:1 ratio of positive to negative examples works for training. For evaluation, use the full set of candidates to get realistic metrics.

Common misconception

“More data always helps.” In link prediction, the quality of negative samples matters more than quantity. A model trained on carefully selected hard negatives often outperforms one trained on 10x more random negatives. The challenge isn’t finding enough data — it’s finding the right data.

One thing to remember: Start with Adamic-Adar or Jaccard as a baseline — they’re free to compute and hard to beat on many networks. Move to supervised models only when you have enough labeled temporal data to train and evaluate properly.

pythonmachine-learninggraph-theory