Contract Analysis with Python NLP — Deep Dive

Build a production contract analysis pipeline in Python with clause extraction, transformer-based classification, risk scoring, and playbook comparison

Document ingestion and cleaning

Legal documents arrive in messy formats — scanned PDFs, Word files with tracked changes, and HTML exports from contract management systems. A robust ingestion layer handles all of them:

import pdfplumber
from docx import Document
import re


def extract_text_from_pdf(path: str) -> str:
    """Extract text from PDF, handling multi-column layouts."""
    pages = []
    with pdfplumber.open(path) as pdf:
        for page in pdf.pages:
            # Use layout extraction for better column handling
            text = page.extract_text(layout=True)
            if text:
                pages.append(text)
    return "\n\n".join(pages)


def extract_text_from_docx(path: str) -> str:
    """Extract text from Word documents, preserving structure."""
    doc = Document(path)
    sections = []
    for para in doc.paragraphs:
        if para.style.name.startswith("Heading"):
            sections.append(f"\n## {para.text}\n")
        else:
            sections.append(para.text)
    return "\n".join(sections)


def normalize_legal_text(text: str) -> str:
    """Normalize common legal formatting issues."""
    # Fix broken hyphenation from PDF extraction
    text = re.sub(r"(\w)-\n(\w)", r"\1\2", text)
    # Normalize section references
    text = re.sub(r"Section\s+(\d+)\.(\d+)", r"Section \1.\2", text)
    # Remove excessive whitespace
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text.strip()

OCR-based documents require an additional step with pytesseract or cloud OCR APIs. Quality matters enormously here — a single misread character in a dollar amount or date can cascade through the entire analysis.

Clause segmentation

Splitting a contract into individual clauses is harder than it sounds. Legal documents use nested numbering systems (1, 1.1, 1.1(a), 1.1(a)(i)), and a single “clause” might span multiple paragraphs.

import re
from dataclasses import dataclass


@dataclass
class Clause:
    section_number: str
    heading: str
    text: str
    start_position: int
    end_position: int


def segment_clauses(text: str) -> list[Clause]:
    """Split contract text into individual clauses using section patterns."""
    # Common legal section numbering patterns
    section_pattern = re.compile(
        r"^(\d+\.(?:\d+\.?)*)\s+([A-Z][^\n.]+?)\.?\s*\n",
        re.MULTILINE,
    )

    matches = list(section_pattern.finditer(text))
    clauses = []

    for i, match in enumerate(matches):
        section_num = match.group(1).rstrip(".")
        heading = match.group(2).strip()
        start = match.end()
        end = matches[i + 1].start() if i + 1 < len(matches) else len(text)
        clause_text = text[start:end].strip()

        clauses.append(Clause(
            section_number=section_num,
            heading=heading,
            text=clause_text,
            start_position=match.start(),
            end_position=end,
        ))

    return clauses

More sophisticated approaches use spaCy’s sentence boundary detection combined with structural cues. Some teams train custom segmentation models on annotated contract data, achieving clause boundary F1 scores above 0.95.

Transformer-based clause classification

The state of the art uses Legal-BERT or similar domain-adapted transformers fine-tuned on labeled clause data:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

CLAUSE_TYPES = [
    "indemnification", "limitation_of_liability", "termination",
    "confidentiality", "governing_law", "assignment",
    "force_majeure", "warranty", "intellectual_property",
    "non_compete", "dispute_resolution", "insurance",
    "data_protection", "payment_terms", "representations",
]

class ClauseClassifier:
    def __init__(self, model_path: str = "nlpaueb/legal-bert-base-uncased"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_path, num_labels=len(CLAUSE_TYPES)
        )
        self.model.eval()

    def classify(self, clause_text: str) -> dict[str, float]:
        """Return probability distribution over clause types."""
        inputs = self.tokenizer(
            clause_text,
            max_length=512,
            truncation=True,
            padding=True,
            return_tensors="pt",
        )
        with torch.no_grad():
            outputs = self.model(**inputs)
            probs = torch.softmax(outputs.logits, dim=-1)[0]

        return {
            CLAUSE_TYPES[i]: probs[i].item()
            for i in range(len(CLAUSE_TYPES))
        }

    def predict(self, clause_text: str) -> tuple[str, float]:
        """Return the top clause type and its confidence."""
        scores = self.classify(clause_text)
        best = max(scores, key=scores.get)
        return best, scores[best]

Fine-tuning on the CUAD dataset typically involves:

Learning rate: 2e-5
Batch size: 16
Epochs: 3-5
Warm-up steps: 10% of total training steps

Performance varies by clause type: governing law clauses reach 97%+ accuracy because they follow rigid patterns (“This Agreement shall be governed by the laws of…”), while indemnification clauses average around 88% due to their structural diversity.

Risk scoring engine

Risk scoring compares extracted clause language against a playbook of approved positions. Each clause type has standard, acceptable, and unacceptable variants:

from dataclasses import dataclass
from enum import Enum
from sentence_transformers import SentenceTransformer, util


class RiskLevel(Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    CRITICAL = "critical"


@dataclass
class PlaybookPosition:
    clause_type: str
    preferred_language: str
    acceptable_language: str
    unacceptable_patterns: list[str]
    risk_factors: list[str]


class RiskScorer:
    def __init__(self):
        self.model = SentenceTransformer("all-MiniLM-L6-v2")
        self.playbook: dict[str, PlaybookPosition] = {}

    def load_playbook(self, positions: list[PlaybookPosition]):
        for pos in positions:
            self.playbook[pos.clause_type] = pos

    def score_clause(
        self, clause_text: str, clause_type: str
    ) -> tuple[RiskLevel, list[str]]:
        """Score a clause against the playbook and return risk level + reasons."""
        reasons = []
        position = self.playbook.get(clause_type)
        if not position:
            return RiskLevel.MEDIUM, ["No playbook position defined"]

        # Check for unacceptable patterns
        for pattern in position.unacceptable_patterns:
            if pattern.lower() in clause_text.lower():
                reasons.append(f"Contains unacceptable pattern: '{pattern}'")

        # Semantic similarity to preferred language
        clause_emb = self.model.encode(clause_text, convert_to_tensor=True)
        preferred_emb = self.model.encode(
            position.preferred_language, convert_to_tensor=True
        )
        similarity = util.cos_sim(clause_emb, preferred_emb).item()

        if similarity < 0.3:
            reasons.append(
                f"Low similarity to preferred language ({similarity:.2f})"
            )

        # Check risk factors
        for factor in position.risk_factors:
            if factor.lower() in clause_text.lower():
                reasons.append(f"Risk factor detected: '{factor}'")

        # Determine overall risk
        if any("unacceptable" in r for r in reasons):
            level = RiskLevel.CRITICAL
        elif len(reasons) >= 3:
            level = RiskLevel.HIGH
        elif len(reasons) >= 1:
            level = RiskLevel.MEDIUM
        else:
            level = RiskLevel.LOW

        return level, reasons

Entity extraction with LexNLP

Legal entities go beyond standard NER. Contracts contain specific constructs that general models miss:

import lexnlp.extract.en.dates as dates_ext
import lexnlp.extract.en.money as money_ext
import lexnlp.extract.en.durations as dur_ext
import lexnlp.extract.en.entities.nltk_re as entity_ext


def extract_legal_entities(text: str) -> dict:
    """Extract all legal-specific entities from contract text."""
    return {
        "dates": list(dates_ext.get_dates(text)),
        "monetary_values": list(money_ext.get_money(text)),
        "durations": list(dur_ext.get_durations(text)),
        "companies": list(entity_ext.get_companies(text)),
        "courts": list(entity_ext.get_courts(text)),
    }


# Example usage
text = """
The Licensee shall pay a royalty of USD 500,000 per annum
for a period of five (5) years commencing on January 1, 2025.
Any disputes shall be resolved in the United States District
Court for the Southern District of New York.
"""

entities = extract_legal_entities(text)
# dates: [datetime(2025, 1, 1)]
# monetary_values: [(500000.0, 'USD')]
# durations: [(5.0, 'year')]

Putting it together: the analysis report

A production system combines all components into a structured report:

from dataclasses import dataclass, field
import json


@dataclass
class ClauseAnalysis:
    section: str
    heading: str
    clause_type: str
    classification_confidence: float
    risk_level: str
    risk_reasons: list[str]
    entities: dict
    text_preview: str


@dataclass
class ContractReport:
    filename: str
    total_clauses: int
    critical_findings: int
    high_risk_findings: int
    analyses: list[ClauseAnalysis] = field(default_factory=list)

    def to_json(self) -> str:
        return json.dumps(
            {
                "filename": self.filename,
                "summary": {
                    "total_clauses": self.total_clauses,
                    "critical": self.critical_findings,
                    "high_risk": self.high_risk_findings,
                },
                "findings": [
                    {
                        "section": a.section,
                        "heading": a.heading,
                        "type": a.clause_type,
                        "confidence": round(a.classification_confidence, 3),
                        "risk": a.risk_level,
                        "reasons": a.risk_reasons,
                        "entities": {
                            k: str(v) for k, v in a.entities.items()
                        },
                        "preview": a.text_preview[:200],
                    }
                    for a in self.analyses
                    if a.risk_level in ("critical", "high")
                ],
            },
            indent=2,
        )

Tradeoffs and production considerations

Transformer token limits — Legal clauses frequently exceed 512 tokens. Options include truncation (loses context), sliding window approaches (adds complexity), or long-context models like Longformer (higher compute cost). In practice, sliding windows with overlapping segments and majority voting work well.

Domain adaptation vs. general models — Legal-BERT outperforms base BERT by 3-8% on legal tasks, but fine-tuning a general model on your specific contract corpus often outperforms a generic legal model. The best results come from domain-adapting first, then fine-tuning.

Confidence thresholds — Setting classification confidence thresholds requires balancing false positives (annoying but safe) against false negatives (dangerous). Most production systems set thresholds at 0.7-0.8 and route low-confidence clauses to human review.

Multilingual contracts — International deals often contain contracts in multiple languages. Multilingual models like XLM-RoBERTa handle this, but performance drops compared to monolingual models. Some teams run language-specific models in parallel.

The one thing to remember: A production contract analysis pipeline chains document ingestion, clause segmentation, transformer classification, playbook-based risk scoring, and entity extraction into a system that surfaces the 5% of clauses that actually need human attention from documents that are 95% boilerplate.

pythonnlplegal-techcontracts