Legal Knowledge Graphs with Python — Deep Dive

Build a legal knowledge graph in Python with Neo4j, automated entity and relationship extraction from case law, Cypher queries for legal research, and graph-enhanced RAG for legal AI

Schema design in Neo4j

A well-designed legal graph schema balances expressiveness with query performance:

from neo4j import GraphDatabase
from dataclasses import dataclass, field


@dataclass
class LegalGraphSchema:
    """Define the legal knowledge graph schema with constraints and indexes."""

    CONSTRAINTS = [
        "CREATE CONSTRAINT IF NOT EXISTS FOR (s:Statute) REQUIRE s.citation IS UNIQUE",
        "CREATE CONSTRAINT IF NOT EXISTS FOR (o:Opinion) REQUIRE o.citation IS UNIQUE",
        "CREATE CONSTRAINT IF NOT EXISTS FOR (j:Judge) REQUIRE j.judge_id IS UNIQUE",
        "CREATE CONSTRAINT IF NOT EXISTS FOR (c:Court) REQUIRE c.court_id IS UNIQUE",
        "CREATE CONSTRAINT IF NOT EXISTS FOR (p:Party) REQUIRE p.party_id IS UNIQUE",
        "CREATE CONSTRAINT IF NOT EXISTS FOR (t:LegalConcept) REQUIRE t.name IS UNIQUE",
        "CREATE CONSTRAINT IF NOT EXISTS FOR (r:Regulation) REQUIRE r.cfr_citation IS UNIQUE",
    ]

    INDEXES = [
        "CREATE INDEX IF NOT EXISTS FOR (o:Opinion) ON (o.date_decided)",
        "CREATE INDEX IF NOT EXISTS FOR (o:Opinion) ON (o.court)",
        "CREATE INDEX IF NOT EXISTS FOR (j:Judge) ON (j.name)",
        "CREATE FULLTEXT INDEX opinion_text IF NOT EXISTS FOR (o:Opinion) ON EACH [o.text, o.case_name]",
    ]


class LegalGraphDB:
    def __init__(self, uri: str, user: str, password: str):
        self.driver = GraphDatabase.driver(uri, auth=(user, password))

    def initialize_schema(self):
        """Create constraints and indexes."""
        schema = LegalGraphSchema()
        with self.driver.session() as session:
            for constraint in schema.CONSTRAINTS:
                session.run(constraint)
            for index in schema.INDEXES:
                session.run(index)

    def close(self):
        self.driver.close()

Ingesting legal data

Loading court opinions and their relationships into the graph:

from dataclasses import dataclass


@dataclass
class OpinionData:
    citation: str
    case_name: str
    court: str
    date_decided: str
    judge_name: str
    judge_id: str
    text: str
    cited_opinions: list[str]  # citations of opinions this one cites
    statutes_cited: list[str]  # statutory citations
    legal_concepts: list[str]  # extracted topics


class LegalGraphIngester:
    def __init__(self, db: LegalGraphDB):
        self.db = db

    def ingest_opinion(self, opinion: OpinionData):
        """Add an opinion and all its relationships to the graph."""
        with self.db.driver.session() as session:
            # Create or merge the opinion node
            session.run("""
                MERGE (o:Opinion {citation: $citation})
                SET o.case_name = $case_name,
                    o.court = $court,
                    o.date_decided = date($date_decided),
                    o.text = $text
            """, {
                "citation": opinion.citation,
                "case_name": opinion.case_name,
                "court": opinion.court,
                "date_decided": opinion.date_decided,
                "text": opinion.text[:10000],  # truncate for storage
            })

            # Create judge and link
            session.run("""
                MERGE (j:Judge {judge_id: $judge_id})
                SET j.name = $judge_name
                WITH j
                MATCH (o:Opinion {citation: $citation})
                MERGE (o)-[:DECIDED_BY]->(j)
            """, {
                "judge_id": opinion.judge_id,
                "judge_name": opinion.judge_name,
                "citation": opinion.citation,
            })

            # Create court and link
            session.run("""
                MERGE (c:Court {court_id: $court})
                WITH c
                MATCH (o:Opinion {citation: $citation})
                MERGE (o)-[:FILED_IN]->(c)
            """, {
                "court": opinion.court,
                "citation": opinion.citation,
            })

            # Create citation relationships
            for cited in opinion.cited_opinions:
                session.run("""
                    MATCH (citing:Opinion {citation: $citing})
                    MERGE (cited:Opinion {citation: $cited})
                    MERGE (citing)-[:CITES]->(cited)
                """, {"citing": opinion.citation, "cited": cited})

            # Link to statutes
            for statute in opinion.statutes_cited:
                session.run("""
                    MATCH (o:Opinion {citation: $opinion})
                    MERGE (s:Statute {citation: $statute})
                    MERGE (o)-[:INTERPRETS]->(s)
                """, {"opinion": opinion.citation, "statute": statute})

            # Link to legal concepts
            for concept in opinion.legal_concepts:
                session.run("""
                    MATCH (o:Opinion {citation: $opinion})
                    MERGE (t:LegalConcept {name: $concept})
                    MERGE (o)-[:ABOUT]->(t)
                """, {"opinion": opinion.citation, "concept": concept})

    def ingest_batch(self, opinions: list[OpinionData], batch_size: int = 100):
        """Ingest opinions in batches for performance."""
        for i in range(0, len(opinions), batch_size):
            batch = opinions[i:i + batch_size]
            for opinion in batch:
                self.ingest_opinion(opinion)

Legal concept extraction

Automatically tagging opinions with legal concepts using a topic classifier:

from transformers import pipeline
from sentence_transformers import SentenceTransformer, util
import numpy as np


LEGAL_CONCEPTS = [
    "due process", "equal protection", "free speech",
    "search and seizure", "cruel and unusual punishment",
    "right to counsel", "double jeopardy", "fair use",
    "patent infringement", "breach of contract",
    "negligence", "strict liability", "standing",
    "sovereign immunity", "qualified immunity",
    "class certification", "arbitration",
    "employment discrimination", "antitrust",
    "securities fraud", "environmental regulation",
]


class LegalConceptExtractor:
    def __init__(self):
        self.model = SentenceTransformer("all-MiniLM-L6-v2")
        self.concept_embeddings = self.model.encode(
            LEGAL_CONCEPTS, convert_to_tensor=True
        )

    def extract_concepts(
        self, opinion_text: str, threshold: float = 0.4, top_k: int = 5
    ) -> list[tuple[str, float]]:
        """Identify legal concepts discussed in an opinion."""
        # Encode the opinion (use summary/headnotes if available)
        text_embedding = self.model.encode(
            opinion_text[:2000],  # first 2000 chars as proxy
            convert_to_tensor=True,
        )

        similarities = util.cos_sim(text_embedding, self.concept_embeddings)[0]
        scored = [
            (LEGAL_CONCEPTS[i], float(similarities[i]))
            for i in range(len(LEGAL_CONCEPTS))
            if float(similarities[i]) >= threshold
        ]
        scored.sort(key=lambda x: x[1], reverse=True)
        return scored[:top_k]

Cypher queries for legal research

The graph enables powerful legal research queries:

class LegalResearchQueries:
    def __init__(self, db: LegalGraphDB):
        self.db = db

    def citation_chain(
        self, start_citation: str, max_depth: int = 3
    ) -> list[dict]:
        """Find the chain of precedent from a case."""
        with self.db.driver.session() as session:
            result = session.run("""
                MATCH path = (start:Opinion {citation: $citation})
                    -[:CITES*1..$depth]->(ancestor:Opinion)
                RETURN [node in nodes(path) | node.citation] AS chain,
                       [node in nodes(path) | node.case_name] AS names,
                       length(path) AS depth
                ORDER BY depth
                LIMIT 50
            """, {"citation": start_citation, "depth": max_depth})
            return [dict(r) for r in result]

    def circuit_split(self, statute_citation: str) -> list[dict]:
        """Find cases where different circuits interpret the same statute differently."""
        with self.db.driver.session() as session:
            result = session.run("""
                MATCH (o1:Opinion)-[:INTERPRETS]->(s:Statute {citation: $statute})
                MATCH (o2:Opinion)-[:INTERPRETS]->(s)
                WHERE o1.court <> o2.court
                  AND o1.citation <> o2.citation
                  AND o1.date_decided > o2.date_decided
                OPTIONAL MATCH (o1)-[r:DISTINGUISHES]->(o2)
                RETURN o1.citation AS later_case,
                       o1.court AS later_court,
                       o2.citation AS earlier_case,
                       o2.court AS earlier_court,
                       r IS NOT NULL AS explicitly_distinguished
                ORDER BY o1.date_decided DESC
                LIMIT 20
            """, {"statute": statute_citation})
            return [dict(r) for r in result]

    def judge_citation_patterns(self, judge_id: str) -> list[dict]:
        """Analyze which authorities a judge cites most frequently."""
        with self.db.driver.session() as session:
            result = session.run("""
                MATCH (j:Judge {judge_id: $judge_id})<-[:DECIDED_BY]-(o:Opinion)
                      -[:CITES]->(cited:Opinion)
                RETURN cited.citation AS cited_case,
                       cited.case_name AS case_name,
                       count(*) AS times_cited
                ORDER BY times_cited DESC
                LIMIT 20
            """, {"judge_id": judge_id})
            return [dict(r) for r in result]

    def concept_evolution(self, concept_name: str) -> list[dict]:
        """Track how a legal concept has evolved over time."""
        with self.db.driver.session() as session:
            result = session.run("""
                MATCH (o:Opinion)-[:ABOUT]->(t:LegalConcept {name: $concept})
                RETURN o.citation AS citation,
                       o.case_name AS case_name,
                       o.court AS court,
                       o.date_decided AS date_decided
                ORDER BY o.date_decided
            """, {"concept": concept_name})
            return [dict(r) for r in result]

    def find_related_authorities(
        self, citation: str, hops: int = 2
    ) -> list[dict]:
        """Find authorities connected within N hops — useful for research expansion."""
        with self.db.driver.session() as session:
            result = session.run("""
                MATCH (start:Opinion {citation: $citation})
                MATCH (start)-[:CITES|INTERPRETS*1..$hops]-(related)
                WHERE related <> start
                RETURN DISTINCT labels(related)[0] AS type,
                       COALESCE(related.citation, related.name) AS identifier,
                       COALESCE(related.case_name, related.citation) AS name
                LIMIT 50
            """, {"citation": citation, "hops": hops})
            return [dict(r) for r in result]

Graph-enhanced RAG for legal AI

Knowledge graphs dramatically improve retrieval-augmented generation for legal question answering:

from sentence_transformers import SentenceTransformer
import numpy as np


class LegalGraphRAG:
    """Combine graph traversal with vector search for legal Q&A."""

    def __init__(self, db: LegalGraphDB):
        self.db = db
        self.embedder = SentenceTransformer("all-MiniLM-L6-v2")

    def retrieve_context(
        self, question: str, max_documents: int = 10
    ) -> list[dict]:
        """Retrieve relevant legal context using graph + vector hybrid search."""
        # Step 1: Full-text search for initial candidates
        with self.db.driver.session() as session:
            text_results = session.run("""
                CALL db.index.fulltext.queryNodes(
                    'opinion_text', $query
                ) YIELD node, score
                RETURN node.citation AS citation,
                       node.case_name AS case_name,
                       node.text AS text,
                       score
                ORDER BY score DESC
                LIMIT $limit
            """, {"query": question, "limit": max_documents // 2})
            initial = [dict(r) for r in text_results]

        # Step 2: Expand via graph — find cited/citing opinions
        expanded = []
        for doc in initial:
            with self.db.driver.session() as session:
                neighbors = session.run("""
                    MATCH (o:Opinion {citation: $citation})
                          -[:CITES|INTERPRETS]-(neighbor)
                    RETURN neighbor.citation AS citation,
                           COALESCE(neighbor.case_name, neighbor.citation) AS name,
                           neighbor.text AS text
                    LIMIT 5
                """, {"citation": doc["citation"]})
                expanded.extend([dict(r) for r in neighbors])

        # Step 3: Re-rank all candidates by semantic similarity to question
        all_candidates = initial + expanded
        if not all_candidates:
            return []

        question_emb = self.embedder.encode(question)
        scored = []
        for cand in all_candidates:
            text = cand.get("text", "")
            if text:
                cand_emb = self.embedder.encode(text[:500])
                similarity = float(np.dot(question_emb, cand_emb) / (
                    np.linalg.norm(question_emb) * np.linalg.norm(cand_emb)
                ))
                scored.append((cand, similarity))

        scored.sort(key=lambda x: x[1], reverse=True)
        return [s[0] for s in scored[:max_documents]]

Production considerations

Incremental updates — Legal databases publish new opinions daily. The graph needs an incremental ingestion pipeline that adds new opinions and their relationships without rebuilding the entire graph. Use MERGE operations in Neo4j to handle upserts idempotently.

Relationship classification — Not all citations are equal. A case might cite precedent favorably (following it), negatively (distinguishing or overruling it), or neutrally (mentioning in passing). Training a classifier to label citation treatment using the surrounding text improves graph quality significantly.

Scale considerations — The US legal corpus alone contains millions of opinions. Neo4j handles this well in production, but queries spanning many hops across millions of nodes need careful optimization. Use node labels, relationship types, and indexed properties to constrain traversals.

Temporal reasoning — Law changes over time. A statute amended in 2020 shouldn’t be linked to an opinion from 1990 interpreting the old version. Track version history on statute nodes and use date-aware queries to ensure temporal consistency.

RDF alternative — For standards-based interoperability, consider RDF with SPARQL using rdflib. The European Legislation Identifier (ELI) and Akoma Ntoso standards define RDF vocabularies for legal knowledge. This trades Neo4j’s query performance for W3C-standard interoperability.

The one thing to remember: A production legal knowledge graph uses Neo4j with automated NLP extraction to build a traversable network of opinions, statutes, judges, and concepts — enabling graph queries, citation analysis, and graph-enhanced RAG that transform legal research from keyword search to relationship navigation.

pythonlegal-techknowledge-graphsgraph-databases