Link Prediction with Python — Core Concepts
The link prediction problem
Given a graph at time T, predict which edges will appear by time T+1. This applies to any network: social connections, protein interactions, citation links, or financial transactions.
Formally: for every pair of unconnected nodes (u, v), compute a score indicating how likely they are to form a connection. Rank all pairs by score and evaluate the top predictions.
Heuristic approaches
These methods compute scores from graph structure alone — no training required.
Common neighbors
The simplest predictor: count shared neighbors.
score(u, v) = |neighbors(u) ∩ neighbors(v)|
If Alice and Bob have 12 mutual friends, their score is 12. Surprisingly competitive for social networks.
Jaccard coefficient
Normalizes common neighbors by total neighbors to account for popularity:
score(u, v) = |neighbors(u) ∩ neighbors(v)| / |neighbors(u) ∪ neighbors(v)|
A Jaccard score of 0.5 means half of their combined neighborhood is shared. This prevents high-degree nodes from dominating predictions.
Adamic-Adar index
Weights each shared neighbor by the inverse log of their degree. Rare shared neighbors count more than popular ones:
score(u, v) = Σ 1 / log(degree(w)) for w in common_neighbors(u, v)
The intuition: if two people share a friend who only has 3 friends, that’s a stronger signal than sharing a friend who has 3,000 friends.
Preferential attachment
High-degree nodes tend to attract more connections (the “rich get richer” effect):
score(u, v) = degree(u) × degree(v)
Simple but works well in networks with strong preferential attachment dynamics, like citation networks.
Supervised link prediction
Heuristics use one feature. Supervised models combine many features for better accuracy:
- Extract features — For each candidate pair (u, v), compute multiple features: common neighbors, Jaccard, Adamic-Adar, shortest path length, community membership, node degrees.
- Create labels — Existing edges are positive examples (label 1). Sample non-edges as negative examples (label 0).
- Train a classifier — Random forest, gradient boosting, or logistic regression.
- Predict — Score all unconnected pairs and rank by probability.
The training/test split must respect time: train on edges from time T, test on edges that appear between T and T+1. Random splits leak future information.
Evaluation metrics
Link prediction is heavily imbalanced — in a graph with n nodes, there are O(n²) possible edges but only O(n) actual edges. Standard accuracy is meaningless.
Appropriate metrics:
- AUC-ROC — The probability that a random positive pair scores higher than a random negative pair. Standard benchmark metric.
- Average Precision (AP) — Summarizes the precision-recall curve. Better than AUC when you care about the top-k predictions.
- Hits@K — Fraction of true positives in the top K predictions. Practical for recommendation systems where only the top results matter.
Negative sampling strategy
Choosing which non-edges to include as negative examples matters:
- Random negatives — Sample random unconnected pairs. Easy but may include trivially unlikely pairs (nodes in completely different parts of the graph).
- Hard negatives — Sample pairs that are close in the graph (2-3 hops apart) but not connected. Harder to classify, producing better models.
A 1:1 ratio of positive to negative examples works for training. For evaluation, use the full set of candidates to get realistic metrics.
Common misconception
“More data always helps.” In link prediction, the quality of negative samples matters more than quantity. A model trained on carefully selected hard negatives often outperforms one trained on 10x more random negatives. The challenge isn’t finding enough data — it’s finding the right data.
One thing to remember: Start with Adamic-Adar or Jaccard as a baseline — they’re free to compute and hard to beat on many networks. Move to supervised models only when you have enough labeled temporal data to train and evaluate properly.
See Also
- Python Community Detection How Python finds hidden groups in networks — friend circles, customer segments, and research clusters — just by looking at who connects to whom.
- Python Graph Embeddings How Python turns tangled webs of connections into neat lists of numbers that computers can actually work with.
- Python Graph Neural Networks How Python teaches computers to learn from connections — not just data points — by combining neural networks with graph structures.
- Python Networkx Graph Analysis How Python maps connections between things — friends, roads, websites — and finds hidden patterns in those connections.
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.