Property Graph Modeling with Python — Deep Dive
Schema design methodology
Property graph modeling follows a query-driven approach. Unlike relational modeling (which starts with entities and normalizes), graph modeling starts with the questions you need to answer and works backward to the structure.
Step 1: Define traversal questions
Write your target queries as natural language:
- “Which products did a customer’s friends purchase in the last 30 days?”
- “What’s the shortest path between two employees through project collaborations?”
- “Which suppliers have the highest defect rate for components used in product X?”
Step 2: Whiteboard the traversal paths
For question 1, the path is:
(Customer)-[:FRIENDS_WITH]->(Friend)-[:PURCHASED]->(Product)
This tells you: you need Customer nodes, a FRIENDS_WITH relationship, and a PURCHASED relationship with a date property.
Step 3: Add properties and constraints
# Model definition as Python dataclasses for validation
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
@dataclass
class Customer:
customer_id: str # unique identifier
name: str
email: str
created_at: datetime
labels: tuple = ("Customer",)
@dataclass
class Product:
sku: str # unique identifier
name: str
price: float
category: str
labels: tuple = ("Product",)
@dataclass
class Purchase:
"""Relationship: Customer -[:PURCHASED]-> Product"""
quantity: int
total_price: float
purchased_at: datetime
channel: str # "web", "mobile", "in-store"
Advanced modeling patterns
Versioned nodes
Track full history of changes without losing data:
def create_versioned_update(tx, entity_id: str, new_props: dict):
"""Archive current version and create a new one."""
tx.run("""
MATCH (current:Entity {entity_id: $id, _is_current: true})
SET current._is_current = false,
current._valid_to = datetime()
CREATE (new:Entity)
SET new = $props,
new.entity_id = $id,
new._is_current = true,
new._valid_from = datetime(),
new._version = coalesce(current._version, 0) + 1
CREATE (current)-[:SUPERSEDED_BY]->(new)
""", id=entity_id, props=new_props)
Fan-out control with summary nodes
When a node accumulates millions of relationships (a celebrity with millions of followers), traversal becomes expensive. Add summary nodes:
(Celebrity)-[:HAS_FOLLOWER_BATCH]->(FollowerBatch {batch_id: 1, count: 10000})
(FollowerBatch)-[:CONTAINS]->(Follower1)
(FollowerBatch)-[:CONTAINS]->(Follower2)
...
This limits the fan-out at any single node and enables efficient pagination.
Multi-tenancy patterns
For SaaS applications serving multiple customers from one graph:
def tenant_query(tx, tenant_id: str, query_fragment: str):
"""All queries filter by tenant to prevent data leakage."""
full_query = f"""
MATCH (t:Tenant {{id: $tenant_id}})
MATCH {query_fragment}
WHERE ALL(n IN nodes(path) WHERE (n)-[:BELONGS_TO_TENANT]->(t) OR n = t)
"""
return tx.run(full_query, tenant_id=tenant_id)
Alternatively, use separate databases per tenant (Neo4j 5.x supports multi-database).
Schema enforcement
Property graphs are traditionally schema-optional, but Neo4j 5.x adds schema enforcement:
def apply_schema(session):
"""Apply constraints and indexes for the data model."""
constraints = [
# Uniqueness
"CREATE CONSTRAINT customer_id IF NOT EXISTS FOR (c:Customer) REQUIRE c.customer_id IS UNIQUE",
"CREATE CONSTRAINT product_sku IF NOT EXISTS FOR (p:Product) REQUIRE p.sku IS UNIQUE",
# Existence (Enterprise only)
"CREATE CONSTRAINT customer_email IF NOT EXISTS FOR (c:Customer) REQUIRE c.email IS NOT NULL",
# Node key (composite uniqueness)
"CREATE CONSTRAINT order_item_key IF NOT EXISTS FOR (oi:OrderItem) REQUIRE (oi.order_id, oi.sku) IS NODE KEY",
]
indexes = [
"CREATE INDEX customer_email_idx IF NOT EXISTS FOR (c:Customer) ON (c.email)",
"CREATE INDEX product_category_idx IF NOT EXISTS FOR (p:Product) ON (p.category)",
"CREATE TEXT INDEX product_name_text IF NOT EXISTS FOR (p:Product) ON (p.name)",
]
for stmt in constraints + indexes:
session.run(stmt)
Validation with neomodel
from neomodel import (
StructuredNode, StringProperty, IntegerProperty,
FloatProperty, DateTimeProperty, RelationshipTo,
UniqueIdProperty, One, ZeroOrMore,
)
class Customer(StructuredNode):
uid = UniqueIdProperty()
name = StringProperty(required=True, max_length=200)
email = StringProperty(required=True, unique_index=True)
created_at = DateTimeProperty(default_now=True)
orders = RelationshipTo("Order", "PLACED", cardinality=ZeroOrMore)
friends = RelationshipTo("Customer", "FRIENDS_WITH", cardinality=ZeroOrMore)
class Order(StructuredNode):
order_id = StringProperty(required=True, unique_index=True)
total = FloatProperty(required=True)
status = StringProperty(choices={"pending": "Pending", "shipped": "Shipped", "delivered": "Delivered"})
placed_at = DateTimeProperty(default_now=True)
items = RelationshipTo("Product", "INCLUDES", cardinality=ZeroOrMore)
Migration strategies
Adding a new relationship type
def migrate_add_category_hierarchy(session):
"""Migration: Add SUBCATEGORY_OF relationships between Category nodes."""
session.run("""
MATCH (sub:Category), (parent:Category)
WHERE sub.parent_name = parent.name AND NOT (sub)-[:SUBCATEGORY_OF]->(parent)
CREATE (sub)-[:SUBCATEGORY_OF]->(parent)
""")
# Clean up the denormalized property
session.run("MATCH (c:Category) REMOVE c.parent_name")
Splitting a node type
When a single node type becomes overloaded (a User that’s both a customer and an admin):
def migrate_split_user_roles(session):
"""Migration: Add secondary labels based on role property."""
session.run("""
MATCH (u:User) WHERE u.role = 'admin'
SET u:Admin
""")
session.run("""
MATCH (u:User) WHERE u.role = 'customer'
SET u:Customer
""")
Anti-patterns
The dense node anti-pattern
A single node connected to millions of others (the “god node”). Queries touching this node scan all its relationships.
Fix: Introduce intermediate grouping nodes, or use relationship properties and indexes to filter without full scans.
The property-bag anti-pattern
Storing everything as properties on a single node type instead of modeling distinct entities:
# Bad: One node with 50 properties
(:Record {customer_name, customer_email, product_name, product_sku, order_date, ...})
# Good: Separate entities with relationships
(:Customer)-[:PLACED]->(:Order)-[:INCLUDES]->(:Product)
The missing relationship direction anti-pattern
Property graph relationships are always directed. Modeling bidirectional concepts (friendship) with two relationships doubles storage and complicates queries.
Fix: Use a single direction and query with undirected pattern matching:
-- Single relationship, query both directions
MATCH (a:Person)-[:FRIENDS_WITH]-(b:Person) -- note: no arrow
Testing graph models
import pytest
from neo4j import GraphDatabase
class TestGraphModel:
"""Validate the graph model against business rules."""
def test_every_order_has_customer(self, session):
result = session.run("""
MATCH (o:Order) WHERE NOT (o)<-[:PLACED]-(:Customer)
RETURN count(o) AS orphans
""").single()
assert result["orphans"] == 0, "Found orders without customers"
def test_no_self_relationships(self, session):
result = session.run("""
MATCH (n)-[r]->(n) RETURN count(r) AS self_loops
""").single()
assert result["self_loops"] == 0, "Found self-referencing relationships"
def test_product_prices_positive(self, session):
result = session.run("""
MATCH (p:Product) WHERE p.price <= 0
RETURN count(p) AS invalid
""").single()
assert result["invalid"] == 0, "Found products with non-positive prices"
Benchmarking model alternatives
When choosing between modeling approaches, benchmark with realistic data volumes:
import time
def benchmark_query(session, query: str, params: dict = None, iterations: int = 100):
times = []
for _ in range(iterations):
start = time.perf_counter()
result = session.run(query, **(params or {}))
list(result) # consume results
times.append(time.perf_counter() - start)
return {
"mean_ms": sum(times) / len(times) * 1000,
"p99_ms": sorted(times)[int(len(times) * 0.99)] * 1000,
"min_ms": min(times) * 1000,
}
Compare models at 10x and 100x your expected data volume to catch scaling issues early.
One thing to remember: Property graph modeling is query-driven design. Write your most important queries first, then shape the graph to make those queries natural single-traversal operations. The model should feel obvious when you see it — if it feels forced, redesign.
See Also
- Python Knowledge Graph Construction How Python builds a web of facts about the world — connecting people, places, and ideas so computers can answer real questions.
- Python Neo4j Integration How Python talks to a database that thinks in connections instead of rows and columns.
- Python Rdf Sparql Queries How Python reads and asks questions about the web's universal language for describing things and their connections.
- Python Arima Forecasting How ARIMA models use patterns in past numbers to predict the future, explained like a bedtime story.
- Python Autocorrelation Analysis How today's number is connected to yesterday's, and why that connection is the secret weapon of time series analysis.