Graph Neural Networks — Core Concepts

Message passing framework, GCN vs. GAT vs. GraphSAGE, graph-level vs. node-level tasks, and why GNNs became essential for molecular property prediction and social recommendation.

Why Regular Neural Networks Fail on Graphs

A standard neural network assumes fixed-size, ordered inputs. A 224×224 image has 150,528 pixels in a specific spatial arrangement. A sentence has words in a specific sequence. The architecture is designed around this structure.

Graphs violate these assumptions:

Variable size: Different graphs have different numbers of nodes and edges
Permutation invariance: The labeling of nodes is arbitrary (node “Alice” = node 0 or node 4 — shouldn’t matter)
Non-Euclidean structure: No natural coordinate system for “up”, “down”, “left”, “right”

GNNs are designed to process graph-structured data respecting these properties.

The Message Passing Framework

The core of most GNNs is the message passing framework (Gilmer et al., 2017 MPNN):

For each node $v$ with neighbors $\mathcal{N}(v)$, at each layer $k$:

$$m_v^{(k)} = \text{AGGREGATE}^{(k)}\left({h_u^{(k-1)} : u \in \mathcal{N}(v)}\right)$$ $$h_v^{(k)} = \text{UPDATE}^{(k)}\left(h_v^{(k-1)}, m_v^{(k)}\right)$$

Where $h_v^{(k)}$ is node $v$‘s representation at layer $k$, and $h_v^{(0)}$ is the initial feature vector.

The AGGREGATE function must be permutation-invariant (the order of neighbors shouldn’t matter): sum, mean, max, or attention-weighted sum.

The UPDATE function combines the current node representation with the aggregated message: typically a linear transformation followed by a nonlinearity.

After $K$ layers of message passing, each node’s representation encodes information from its $K$-hop neighborhood.

Key Architectures

GCN (Graph Convolutional Network)

Kipf & Welling (2017) simplified the spectral graph convolution to:

$$H^{(k+1)} = \sigma\left(\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2} H^{(k)} W^{(k)}\right)$$

Where $\tilde{A} = A + I$ (adjacency with self-loops), $\tilde{D}$ is the diagonal degree matrix, and $W^{(k)}$ is the layer weight matrix. The $\tilde{D}^{-1/2}\tilde{A}\tilde{D}^{-1/2}$ term normalizes the adjacency matrix so high-degree nodes don’t dominate.

In vector form per node $v$: $$h_v^{(k+1)} = \sigma\left(W^{(k)} \cdot \text{MEAN}\left({h_u^{(k)} : u \in \mathcal{N}(v) \cup {v}\right}\right)$$

Simple, fast, widely used. Limitation: equal weight to all neighbors (no importance distinction).

GAT (Graph Attention Network)

Veličković et al. (2018) added attention mechanisms — learn how much attention to give each neighbor:

$$\alpha_{vu} = \frac{\exp(\text{LeakyReLU}(a^T [W h_v || W h_u]))}{\sum_{w \in \mathcal{N}(v)} \exp(\text{LeakyReLU}(a^T [W h_v || W h_w]))}$$

$$h_v^{(k+1)} = \sigma\left(\sum_{u \in \mathcal{N}(v)} \alpha_{vu} W h_u\right)$$

Multi-head attention applies $K$ independent attention mechanisms and concatenates/averages results. GAT captures varying importance of neighbors and is more flexible than GCN.

GraphSAGE

Hamilton et al. (2017) designed GraphSAGE for inductive learning — generalizing to nodes not seen during training (crucial for dynamic graphs like social networks with new users):

$$h_{\mathcal{N}(v)}^{(k)} = \text{AGGREGATE}k\left({h_u^{(k-1)}, \forall u \in \mathcal{N}(v)}\right)$$ $$h_v^{(k)} = \sigma\left(W^k \cdot [h_v^{(k-1)} || h{\mathcal{N}(v)}^{(k)}]\right)$$

Key innovation: concatenates (rather than averages) the node’s own representation with the aggregated neighbor representation, preserving the node’s identity information. Uses mini-batch training with neighbor sampling — samples a fixed-size set of neighbors rather than using all neighbors (scalable to billion-node graphs).

PinSage (Ying et al., 2018) applied GraphSAGE to Pinterest’s 3 billion node graph, producing recommendations that increased user engagement by 40% and is still running in production.

Node, Edge, and Graph Level Tasks

Node classification: Predict a property of each node. Example: in a citation network, classify each paper into a research area. GCN was originally proposed for this.

Link prediction: Predict whether an edge should exist between two nodes. Example: recommend new connections in a social network, predict protein-protein interactions. Output: probability of edge existence for each (u, v) pair.

Graph classification: Classify entire graphs. Example: classify molecules as drug candidates (positive/toxic/inactive). Requires a readout (pooling) function to aggregate node representations into a graph-level representation: sum/mean pooling, hierarchical pooling, or attention-weighted pooling.

Edge regression: Predict properties of edges (e.g., reaction rates for chemical bonds, traffic flow on road segments).

GNNs for Molecular Property Prediction

Molecules are natural graphs: atoms as nodes (features: element, charge, hybridization), bonds as edges (features: bond type, stereo chemistry).

Applications:

ADMET prediction: Absorption, Distribution, Metabolism, Excretion, Toxicity — critical for drug candidate screening
Reaction yield prediction: Which reaction conditions give highest yield?
De novo drug design: Generate new molecular structures with desired properties

D-MPNN (Directed Message Passing Neural Network, Yang et al., 2019): Models bonds rather than atoms as the primary message passing unit. Each directed bond aggregates messages from all other bonds entering the source atom. Outperforms atom-level MPNN on molecular property prediction.

Equivariant GNNs: For 3D molecular property prediction (including atomic positions), the model must be equivariant to rotation and translation. SchNet (Schütt et al., 2017) uses radial basis functions over interatomic distances. EGNN (2021) maintains equivariance while updating 3D coordinates during message passing.

DeepMind’s AlphaFold 2 uses Invariant Point Attention (IPA) — a carefully designed equivariant attention mechanism — to process protein structure graphs and predict 3D coordinates of all atoms.

One thing to remember: GNNs are the natural tool for any problem where relationships matter — and because most real-world data has relational structure (proteins, molecules, social networks, knowledge graphs), they’re increasingly central to scientific discovery and recommendation systems.

graph-neural-networksgcngatgraphsagemessage-passingmolecular-ml