Legal Knowledge Graphs with Python — Deep Dive
Schema design in Neo4j
A well-designed legal graph schema balances expressiveness with query performance:
from neo4j import GraphDatabase
from dataclasses import dataclass, field
@dataclass
class LegalGraphSchema:
"""Define the legal knowledge graph schema with constraints and indexes."""
CONSTRAINTS = [
"CREATE CONSTRAINT IF NOT EXISTS FOR (s:Statute) REQUIRE s.citation IS UNIQUE",
"CREATE CONSTRAINT IF NOT EXISTS FOR (o:Opinion) REQUIRE o.citation IS UNIQUE",
"CREATE CONSTRAINT IF NOT EXISTS FOR (j:Judge) REQUIRE j.judge_id IS UNIQUE",
"CREATE CONSTRAINT IF NOT EXISTS FOR (c:Court) REQUIRE c.court_id IS UNIQUE",
"CREATE CONSTRAINT IF NOT EXISTS FOR (p:Party) REQUIRE p.party_id IS UNIQUE",
"CREATE CONSTRAINT IF NOT EXISTS FOR (t:LegalConcept) REQUIRE t.name IS UNIQUE",
"CREATE CONSTRAINT IF NOT EXISTS FOR (r:Regulation) REQUIRE r.cfr_citation IS UNIQUE",
]
INDEXES = [
"CREATE INDEX IF NOT EXISTS FOR (o:Opinion) ON (o.date_decided)",
"CREATE INDEX IF NOT EXISTS FOR (o:Opinion) ON (o.court)",
"CREATE INDEX IF NOT EXISTS FOR (j:Judge) ON (j.name)",
"CREATE FULLTEXT INDEX opinion_text IF NOT EXISTS FOR (o:Opinion) ON EACH [o.text, o.case_name]",
]
class LegalGraphDB:
def __init__(self, uri: str, user: str, password: str):
self.driver = GraphDatabase.driver(uri, auth=(user, password))
def initialize_schema(self):
"""Create constraints and indexes."""
schema = LegalGraphSchema()
with self.driver.session() as session:
for constraint in schema.CONSTRAINTS:
session.run(constraint)
for index in schema.INDEXES:
session.run(index)
def close(self):
self.driver.close()
Ingesting legal data
Loading court opinions and their relationships into the graph:
from dataclasses import dataclass
@dataclass
class OpinionData:
citation: str
case_name: str
court: str
date_decided: str
judge_name: str
judge_id: str
text: str
cited_opinions: list[str] # citations of opinions this one cites
statutes_cited: list[str] # statutory citations
legal_concepts: list[str] # extracted topics
class LegalGraphIngester:
def __init__(self, db: LegalGraphDB):
self.db = db
def ingest_opinion(self, opinion: OpinionData):
"""Add an opinion and all its relationships to the graph."""
with self.db.driver.session() as session:
# Create or merge the opinion node
session.run("""
MERGE (o:Opinion {citation: $citation})
SET o.case_name = $case_name,
o.court = $court,
o.date_decided = date($date_decided),
o.text = $text
""", {
"citation": opinion.citation,
"case_name": opinion.case_name,
"court": opinion.court,
"date_decided": opinion.date_decided,
"text": opinion.text[:10000], # truncate for storage
})
# Create judge and link
session.run("""
MERGE (j:Judge {judge_id: $judge_id})
SET j.name = $judge_name
WITH j
MATCH (o:Opinion {citation: $citation})
MERGE (o)-[:DECIDED_BY]->(j)
""", {
"judge_id": opinion.judge_id,
"judge_name": opinion.judge_name,
"citation": opinion.citation,
})
# Create court and link
session.run("""
MERGE (c:Court {court_id: $court})
WITH c
MATCH (o:Opinion {citation: $citation})
MERGE (o)-[:FILED_IN]->(c)
""", {
"court": opinion.court,
"citation": opinion.citation,
})
# Create citation relationships
for cited in opinion.cited_opinions:
session.run("""
MATCH (citing:Opinion {citation: $citing})
MERGE (cited:Opinion {citation: $cited})
MERGE (citing)-[:CITES]->(cited)
""", {"citing": opinion.citation, "cited": cited})
# Link to statutes
for statute in opinion.statutes_cited:
session.run("""
MATCH (o:Opinion {citation: $opinion})
MERGE (s:Statute {citation: $statute})
MERGE (o)-[:INTERPRETS]->(s)
""", {"opinion": opinion.citation, "statute": statute})
# Link to legal concepts
for concept in opinion.legal_concepts:
session.run("""
MATCH (o:Opinion {citation: $opinion})
MERGE (t:LegalConcept {name: $concept})
MERGE (o)-[:ABOUT]->(t)
""", {"opinion": opinion.citation, "concept": concept})
def ingest_batch(self, opinions: list[OpinionData], batch_size: int = 100):
"""Ingest opinions in batches for performance."""
for i in range(0, len(opinions), batch_size):
batch = opinions[i:i + batch_size]
for opinion in batch:
self.ingest_opinion(opinion)
Legal concept extraction
Automatically tagging opinions with legal concepts using a topic classifier:
from transformers import pipeline
from sentence_transformers import SentenceTransformer, util
import numpy as np
LEGAL_CONCEPTS = [
"due process", "equal protection", "free speech",
"search and seizure", "cruel and unusual punishment",
"right to counsel", "double jeopardy", "fair use",
"patent infringement", "breach of contract",
"negligence", "strict liability", "standing",
"sovereign immunity", "qualified immunity",
"class certification", "arbitration",
"employment discrimination", "antitrust",
"securities fraud", "environmental regulation",
]
class LegalConceptExtractor:
def __init__(self):
self.model = SentenceTransformer("all-MiniLM-L6-v2")
self.concept_embeddings = self.model.encode(
LEGAL_CONCEPTS, convert_to_tensor=True
)
def extract_concepts(
self, opinion_text: str, threshold: float = 0.4, top_k: int = 5
) -> list[tuple[str, float]]:
"""Identify legal concepts discussed in an opinion."""
# Encode the opinion (use summary/headnotes if available)
text_embedding = self.model.encode(
opinion_text[:2000], # first 2000 chars as proxy
convert_to_tensor=True,
)
similarities = util.cos_sim(text_embedding, self.concept_embeddings)[0]
scored = [
(LEGAL_CONCEPTS[i], float(similarities[i]))
for i in range(len(LEGAL_CONCEPTS))
if float(similarities[i]) >= threshold
]
scored.sort(key=lambda x: x[1], reverse=True)
return scored[:top_k]
Cypher queries for legal research
The graph enables powerful legal research queries:
class LegalResearchQueries:
def __init__(self, db: LegalGraphDB):
self.db = db
def citation_chain(
self, start_citation: str, max_depth: int = 3
) -> list[dict]:
"""Find the chain of precedent from a case."""
with self.db.driver.session() as session:
result = session.run("""
MATCH path = (start:Opinion {citation: $citation})
-[:CITES*1..$depth]->(ancestor:Opinion)
RETURN [node in nodes(path) | node.citation] AS chain,
[node in nodes(path) | node.case_name] AS names,
length(path) AS depth
ORDER BY depth
LIMIT 50
""", {"citation": start_citation, "depth": max_depth})
return [dict(r) for r in result]
def circuit_split(self, statute_citation: str) -> list[dict]:
"""Find cases where different circuits interpret the same statute differently."""
with self.db.driver.session() as session:
result = session.run("""
MATCH (o1:Opinion)-[:INTERPRETS]->(s:Statute {citation: $statute})
MATCH (o2:Opinion)-[:INTERPRETS]->(s)
WHERE o1.court <> o2.court
AND o1.citation <> o2.citation
AND o1.date_decided > o2.date_decided
OPTIONAL MATCH (o1)-[r:DISTINGUISHES]->(o2)
RETURN o1.citation AS later_case,
o1.court AS later_court,
o2.citation AS earlier_case,
o2.court AS earlier_court,
r IS NOT NULL AS explicitly_distinguished
ORDER BY o1.date_decided DESC
LIMIT 20
""", {"statute": statute_citation})
return [dict(r) for r in result]
def judge_citation_patterns(self, judge_id: str) -> list[dict]:
"""Analyze which authorities a judge cites most frequently."""
with self.db.driver.session() as session:
result = session.run("""
MATCH (j:Judge {judge_id: $judge_id})<-[:DECIDED_BY]-(o:Opinion)
-[:CITES]->(cited:Opinion)
RETURN cited.citation AS cited_case,
cited.case_name AS case_name,
count(*) AS times_cited
ORDER BY times_cited DESC
LIMIT 20
""", {"judge_id": judge_id})
return [dict(r) for r in result]
def concept_evolution(self, concept_name: str) -> list[dict]:
"""Track how a legal concept has evolved over time."""
with self.db.driver.session() as session:
result = session.run("""
MATCH (o:Opinion)-[:ABOUT]->(t:LegalConcept {name: $concept})
RETURN o.citation AS citation,
o.case_name AS case_name,
o.court AS court,
o.date_decided AS date_decided
ORDER BY o.date_decided
""", {"concept": concept_name})
return [dict(r) for r in result]
def find_related_authorities(
self, citation: str, hops: int = 2
) -> list[dict]:
"""Find authorities connected within N hops — useful for research expansion."""
with self.db.driver.session() as session:
result = session.run("""
MATCH (start:Opinion {citation: $citation})
MATCH (start)-[:CITES|INTERPRETS*1..$hops]-(related)
WHERE related <> start
RETURN DISTINCT labels(related)[0] AS type,
COALESCE(related.citation, related.name) AS identifier,
COALESCE(related.case_name, related.citation) AS name
LIMIT 50
""", {"citation": citation, "hops": hops})
return [dict(r) for r in result]
Graph-enhanced RAG for legal AI
Knowledge graphs dramatically improve retrieval-augmented generation for legal question answering:
from sentence_transformers import SentenceTransformer
import numpy as np
class LegalGraphRAG:
"""Combine graph traversal with vector search for legal Q&A."""
def __init__(self, db: LegalGraphDB):
self.db = db
self.embedder = SentenceTransformer("all-MiniLM-L6-v2")
def retrieve_context(
self, question: str, max_documents: int = 10
) -> list[dict]:
"""Retrieve relevant legal context using graph + vector hybrid search."""
# Step 1: Full-text search for initial candidates
with self.db.driver.session() as session:
text_results = session.run("""
CALL db.index.fulltext.queryNodes(
'opinion_text', $query
) YIELD node, score
RETURN node.citation AS citation,
node.case_name AS case_name,
node.text AS text,
score
ORDER BY score DESC
LIMIT $limit
""", {"query": question, "limit": max_documents // 2})
initial = [dict(r) for r in text_results]
# Step 2: Expand via graph — find cited/citing opinions
expanded = []
for doc in initial:
with self.db.driver.session() as session:
neighbors = session.run("""
MATCH (o:Opinion {citation: $citation})
-[:CITES|INTERPRETS]-(neighbor)
RETURN neighbor.citation AS citation,
COALESCE(neighbor.case_name, neighbor.citation) AS name,
neighbor.text AS text
LIMIT 5
""", {"citation": doc["citation"]})
expanded.extend([dict(r) for r in neighbors])
# Step 3: Re-rank all candidates by semantic similarity to question
all_candidates = initial + expanded
if not all_candidates:
return []
question_emb = self.embedder.encode(question)
scored = []
for cand in all_candidates:
text = cand.get("text", "")
if text:
cand_emb = self.embedder.encode(text[:500])
similarity = float(np.dot(question_emb, cand_emb) / (
np.linalg.norm(question_emb) * np.linalg.norm(cand_emb)
))
scored.append((cand, similarity))
scored.sort(key=lambda x: x[1], reverse=True)
return [s[0] for s in scored[:max_documents]]
Production considerations
Incremental updates — Legal databases publish new opinions daily. The graph needs an incremental ingestion pipeline that adds new opinions and their relationships without rebuilding the entire graph. Use MERGE operations in Neo4j to handle upserts idempotently.
Relationship classification — Not all citations are equal. A case might cite precedent favorably (following it), negatively (distinguishing or overruling it), or neutrally (mentioning in passing). Training a classifier to label citation treatment using the surrounding text improves graph quality significantly.
Scale considerations — The US legal corpus alone contains millions of opinions. Neo4j handles this well in production, but queries spanning many hops across millions of nodes need careful optimization. Use node labels, relationship types, and indexed properties to constrain traversals.
Temporal reasoning — Law changes over time. A statute amended in 2020 shouldn’t be linked to an opinion from 1990 interpreting the old version. Track version history on statute nodes and use date-aware queries to ensure temporal consistency.
RDF alternative — For standards-based interoperability, consider RDF with SPARQL using rdflib. The European Legislation Identifier (ELI) and Akoma Ntoso standards define RDF vocabularies for legal knowledge. This trades Neo4j’s query performance for W3C-standard interoperability.
The one thing to remember: A production legal knowledge graph uses Neo4j with automated NLP extraction to build a traversable network of opinions, statutes, judges, and concepts — enabling graph queries, citation analysis, and graph-enhanced RAG that transform legal research from keyword search to relationship navigation.
See Also
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
- Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
- Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.
- Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'