Graph Embeddings with Python — Deep Dive

Node2Vec implementation

The node2vec library wraps the algorithm with a NetworkX-compatible API:

import networkx as nx
from node2vec import Node2Vec

# Build or load a graph
G = nx.karate_club_graph()

# Configure the walk generator
node2vec = Node2Vec(
    G,
    dimensions=128,    # embedding size
    walk_length=30,    # steps per walk
    num_walks=200,     # walks per node
    p=1.0,             # return parameter
    q=0.5,             # in-out parameter (q < 1 → BFS-like, community-focused)
    workers=4,
)

# Train Word2Vec on the generated walks
model = node2vec.fit(window=10, min_count=1, batch_words=4)

# Get embedding for node 0
embedding = model.wv[str(0)]  # shape: (128,)

# Find most similar nodes
similar = model.wv.most_similar(str(0), topn=5)

Hyperparameter tuning

The p and q parameters create a spectrum of behaviors:

pqBehaviorBest for
11Uniform random walk (DeepWalk)Balanced
0.52Stay local, avoid backtrackHomophily (similar neighbors)
20.5Explore widelyStructural equivalence

Tune these using a downstream task: train a classifier on embeddings with different (p, q) combinations and pick the pair with the best F1 score.

Scaling to large graphs

The node2vec Python package stores walks in memory, which fails for graphs with millions of nodes. Alternatives:

  • PecanPy — A memory-efficient Node2Vec implementation that uses CSR-format graphs and pre-computes transition probabilities compactly. Handles graphs with 100M+ edges.
  • PyTorch-Geometric’s Node2Vec — GPU-accelerated walk generation and embedding training.
from torch_geometric.nn import Node2Vec as TGNode2Vec
from torch_geometric.datasets import Planetoid

dataset = Planetoid(root="/tmp/Cora", name="Cora")
data = dataset[0]

model = TGNode2Vec(
    data.edge_index,
    embedding_dim=128,
    walk_length=20,
    context_size=10,
    walks_per_node=10,
    num_negative_samples=1,
    p=1.0,
    q=1.0,
).to("cuda")

optimizer = torch.optim.SparseAdam(model.parameters(), lr=0.01)

def train():
    model.train()
    total_loss = 0
    loader = model.loader(batch_size=128, shuffle=True, num_workers=4)
    for pos_rw, neg_rw in loader:
        optimizer.zero_grad()
        loss = model.loss(pos_rw.to("cuda"), neg_rw.to("cuda"))
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(loader)

Knowledge graph embeddings with PyKEEN

PyKEEN (Python KnowlEdge EmbeddiNgs) provides a unified interface for training and evaluating KG embedding models:

from pykeen.pipeline import pipeline
from pykeen.datasets import Nations

result = pipeline(
    dataset="Nations",
    model="TransE",
    training_kwargs=dict(num_epochs=100, batch_size=256),
    optimizer_kwargs=dict(lr=0.01),
    evaluation_kwargs=dict(batch_size=256),
    random_seed=42,
)

# Access trained model
model = result.model

# Predict missing links
from pykeen.predict import predict_target

predictions = predict_target(
    model=model,
    head="brazil",
    relation="intergovernmentalagreement",
    triples_factory=result.training,
)
print(predictions.df.head(10))

TransE internals

TransE learns embeddings such that for a valid triple (h, r, t):

embedding(h) + embedding(r) ≈ embedding(t)

The loss function is a margin-based ranking loss:

L = Σ max(0, γ + d(h + r, t) - d(h' + r, t'))

Where (h’, r, t’) is a corrupted triple (replace h or t with a random entity), d is a distance function (L1 or L2), and γ is the margin. The model learns to score valid triples lower than corrupted ones.

TransE struggles with 1-to-N relations (one country has many cities) because all city embeddings would need to be at the same point. RotatE and DistMult handle these cases better.

Matrix factorization with sklearn

For smaller graphs, decompose the adjacency matrix directly:

import numpy as np
from sklearn.decomposition import TruncatedSVD
import scipy.sparse as sp

# Adjacency matrix as sparse CSR
A = nx.adjacency_matrix(G)

# Compute k-step transition: A + A^2 (captures 2-hop neighbors)
A2 = A @ A
M = A + A2

# Factorize
svd = TruncatedSVD(n_components=64, random_state=42)
embeddings = svd.fit_transform(M.astype(float))

# embeddings[i] is the 64-dim vector for node i

This approach has O(n² k) complexity for k-step matrices, limiting it to graphs under ~100K nodes.

Evaluation pipeline

from sklearn.metrics import roc_auc_score
import numpy as np

def evaluate_link_prediction(embeddings, test_edges, negative_edges):
    """Evaluate embeddings on link prediction using cosine similarity."""
    def score(u, v):
        eu, ev = embeddings[u], embeddings[v]
        return np.dot(eu, ev) / (np.linalg.norm(eu) * np.linalg.norm(ev) + 1e-8)

    pos_scores = [score(u, v) for u, v in test_edges]
    neg_scores = [score(u, v) for u, v in negative_edges]

    labels = [1] * len(pos_scores) + [0] * len(neg_scores)
    scores = pos_scores + neg_scores

    auc = roc_auc_score(labels, scores)
    return auc

Node classification evaluation

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

def evaluate_node_classification(embeddings, labels):
    clf = LogisticRegression(max_iter=1000, solver="lbfgs", multi_class="auto")
    scores = cross_val_score(clf, embeddings, labels, cv=5, scoring="f1_macro")
    return scores.mean(), scores.std()

Embedding alignment and comparison

When comparing embeddings from different methods, raw cosine similarity doesn’t work because methods may use different coordinate systems. Use Procrustes alignment:

from scipy.spatial import procrustes

def align_embeddings(emb1, emb2):
    """Align emb2 to emb1 using orthogonal Procrustes."""
    mtx1, mtx2, disparity = procrustes(emb1, emb2)
    return mtx1, mtx2, disparity

Production deployment

Serving embeddings

Pre-compute embeddings and serve them via a key-value store:

import redis
import numpy as np

r = redis.Redis()

def store_embeddings(node_embeddings: dict[str, np.ndarray]):
    pipe = r.pipeline()
    for node_id, emb in node_embeddings.items():
        pipe.set(f"emb:{node_id}", emb.tobytes())
    pipe.execute()

def get_embedding(node_id: str, dim: int = 128) -> np.ndarray:
    raw = r.get(f"emb:{node_id}")
    if raw is None:
        return None
    return np.frombuffer(raw, dtype=np.float32).reshape(dim)

Incremental updates

When the graph changes (new nodes, new edges), you have three options:

  1. Retrain from scratch — Simple but expensive. Acceptable for nightly batch jobs.
  2. Warm-start — Initialize with previous embeddings and run additional training epochs on the updated graph.
  3. Inductive methods — Use GNN-based approaches (GraphSAGE) that generalize to unseen nodes without retraining.

Dimensionality considerations

DimensionMemory per nodeTypical use case
32128 bytesSmall graphs, quick experiments
128512 bytesProduction, balanced accuracy/cost
2561 KBHigh-accuracy requirements
512+2+ KBDiminishing returns for most tasks

For a graph with 10M nodes at 128 dimensions, embeddings consume ~5 GB of memory — significant but feasible for a single server.

One thing to remember: The embedding method matters less than the evaluation. Always benchmark multiple methods (Node2Vec, matrix factorization, GNN-based) on your specific downstream task before committing to a production pipeline.

pythonmachine-learninggraph-theory

See Also

  • Python Community Detection How Python finds hidden groups in networks — friend circles, customer segments, and research clusters — just by looking at who connects to whom.
  • Python Graph Neural Networks How Python teaches computers to learn from connections — not just data points — by combining neural networks with graph structures.
  • Python Link Prediction How Python guesses which connections are missing from a network — predicting future friendships, recommendations, and undiscovered relationships.
  • Python Networkx Graph Analysis How Python maps connections between things — friends, roads, websites — and finds hidden patterns in those connections.
  • Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.