Contract Analysis with Python NLP — Deep Dive
Document ingestion and cleaning
Legal documents arrive in messy formats — scanned PDFs, Word files with tracked changes, and HTML exports from contract management systems. A robust ingestion layer handles all of them:
import pdfplumber
from docx import Document
import re
def extract_text_from_pdf(path: str) -> str:
"""Extract text from PDF, handling multi-column layouts."""
pages = []
with pdfplumber.open(path) as pdf:
for page in pdf.pages:
# Use layout extraction for better column handling
text = page.extract_text(layout=True)
if text:
pages.append(text)
return "\n\n".join(pages)
def extract_text_from_docx(path: str) -> str:
"""Extract text from Word documents, preserving structure."""
doc = Document(path)
sections = []
for para in doc.paragraphs:
if para.style.name.startswith("Heading"):
sections.append(f"\n## {para.text}\n")
else:
sections.append(para.text)
return "\n".join(sections)
def normalize_legal_text(text: str) -> str:
"""Normalize common legal formatting issues."""
# Fix broken hyphenation from PDF extraction
text = re.sub(r"(\w)-\n(\w)", r"\1\2", text)
# Normalize section references
text = re.sub(r"Section\s+(\d+)\.(\d+)", r"Section \1.\2", text)
# Remove excessive whitespace
text = re.sub(r"\n{3,}", "\n\n", text)
return text.strip()
OCR-based documents require an additional step with pytesseract or cloud OCR APIs. Quality matters enormously here — a single misread character in a dollar amount or date can cascade through the entire analysis.
Clause segmentation
Splitting a contract into individual clauses is harder than it sounds. Legal documents use nested numbering systems (1, 1.1, 1.1(a), 1.1(a)(i)), and a single “clause” might span multiple paragraphs.
import re
from dataclasses import dataclass
@dataclass
class Clause:
section_number: str
heading: str
text: str
start_position: int
end_position: int
def segment_clauses(text: str) -> list[Clause]:
"""Split contract text into individual clauses using section patterns."""
# Common legal section numbering patterns
section_pattern = re.compile(
r"^(\d+\.(?:\d+\.?)*)\s+([A-Z][^\n.]+?)\.?\s*\n",
re.MULTILINE,
)
matches = list(section_pattern.finditer(text))
clauses = []
for i, match in enumerate(matches):
section_num = match.group(1).rstrip(".")
heading = match.group(2).strip()
start = match.end()
end = matches[i + 1].start() if i + 1 < len(matches) else len(text)
clause_text = text[start:end].strip()
clauses.append(Clause(
section_number=section_num,
heading=heading,
text=clause_text,
start_position=match.start(),
end_position=end,
))
return clauses
More sophisticated approaches use spaCy’s sentence boundary detection combined with structural cues. Some teams train custom segmentation models on annotated contract data, achieving clause boundary F1 scores above 0.95.
Transformer-based clause classification
The state of the art uses Legal-BERT or similar domain-adapted transformers fine-tuned on labeled clause data:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
CLAUSE_TYPES = [
"indemnification", "limitation_of_liability", "termination",
"confidentiality", "governing_law", "assignment",
"force_majeure", "warranty", "intellectual_property",
"non_compete", "dispute_resolution", "insurance",
"data_protection", "payment_terms", "representations",
]
class ClauseClassifier:
def __init__(self, model_path: str = "nlpaueb/legal-bert-base-uncased"):
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
self.model = AutoModelForSequenceClassification.from_pretrained(
model_path, num_labels=len(CLAUSE_TYPES)
)
self.model.eval()
def classify(self, clause_text: str) -> dict[str, float]:
"""Return probability distribution over clause types."""
inputs = self.tokenizer(
clause_text,
max_length=512,
truncation=True,
padding=True,
return_tensors="pt",
)
with torch.no_grad():
outputs = self.model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)[0]
return {
CLAUSE_TYPES[i]: probs[i].item()
for i in range(len(CLAUSE_TYPES))
}
def predict(self, clause_text: str) -> tuple[str, float]:
"""Return the top clause type and its confidence."""
scores = self.classify(clause_text)
best = max(scores, key=scores.get)
return best, scores[best]
Fine-tuning on the CUAD dataset typically involves:
- Learning rate: 2e-5
- Batch size: 16
- Epochs: 3-5
- Warm-up steps: 10% of total training steps
Performance varies by clause type: governing law clauses reach 97%+ accuracy because they follow rigid patterns (“This Agreement shall be governed by the laws of…”), while indemnification clauses average around 88% due to their structural diversity.
Risk scoring engine
Risk scoring compares extracted clause language against a playbook of approved positions. Each clause type has standard, acceptable, and unacceptable variants:
from dataclasses import dataclass
from enum import Enum
from sentence_transformers import SentenceTransformer, util
class RiskLevel(Enum):
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
CRITICAL = "critical"
@dataclass
class PlaybookPosition:
clause_type: str
preferred_language: str
acceptable_language: str
unacceptable_patterns: list[str]
risk_factors: list[str]
class RiskScorer:
def __init__(self):
self.model = SentenceTransformer("all-MiniLM-L6-v2")
self.playbook: dict[str, PlaybookPosition] = {}
def load_playbook(self, positions: list[PlaybookPosition]):
for pos in positions:
self.playbook[pos.clause_type] = pos
def score_clause(
self, clause_text: str, clause_type: str
) -> tuple[RiskLevel, list[str]]:
"""Score a clause against the playbook and return risk level + reasons."""
reasons = []
position = self.playbook.get(clause_type)
if not position:
return RiskLevel.MEDIUM, ["No playbook position defined"]
# Check for unacceptable patterns
for pattern in position.unacceptable_patterns:
if pattern.lower() in clause_text.lower():
reasons.append(f"Contains unacceptable pattern: '{pattern}'")
# Semantic similarity to preferred language
clause_emb = self.model.encode(clause_text, convert_to_tensor=True)
preferred_emb = self.model.encode(
position.preferred_language, convert_to_tensor=True
)
similarity = util.cos_sim(clause_emb, preferred_emb).item()
if similarity < 0.3:
reasons.append(
f"Low similarity to preferred language ({similarity:.2f})"
)
# Check risk factors
for factor in position.risk_factors:
if factor.lower() in clause_text.lower():
reasons.append(f"Risk factor detected: '{factor}'")
# Determine overall risk
if any("unacceptable" in r for r in reasons):
level = RiskLevel.CRITICAL
elif len(reasons) >= 3:
level = RiskLevel.HIGH
elif len(reasons) >= 1:
level = RiskLevel.MEDIUM
else:
level = RiskLevel.LOW
return level, reasons
Entity extraction with LexNLP
Legal entities go beyond standard NER. Contracts contain specific constructs that general models miss:
import lexnlp.extract.en.dates as dates_ext
import lexnlp.extract.en.money as money_ext
import lexnlp.extract.en.durations as dur_ext
import lexnlp.extract.en.entities.nltk_re as entity_ext
def extract_legal_entities(text: str) -> dict:
"""Extract all legal-specific entities from contract text."""
return {
"dates": list(dates_ext.get_dates(text)),
"monetary_values": list(money_ext.get_money(text)),
"durations": list(dur_ext.get_durations(text)),
"companies": list(entity_ext.get_companies(text)),
"courts": list(entity_ext.get_courts(text)),
}
# Example usage
text = """
The Licensee shall pay a royalty of USD 500,000 per annum
for a period of five (5) years commencing on January 1, 2025.
Any disputes shall be resolved in the United States District
Court for the Southern District of New York.
"""
entities = extract_legal_entities(text)
# dates: [datetime(2025, 1, 1)]
# monetary_values: [(500000.0, 'USD')]
# durations: [(5.0, 'year')]
Putting it together: the analysis report
A production system combines all components into a structured report:
from dataclasses import dataclass, field
import json
@dataclass
class ClauseAnalysis:
section: str
heading: str
clause_type: str
classification_confidence: float
risk_level: str
risk_reasons: list[str]
entities: dict
text_preview: str
@dataclass
class ContractReport:
filename: str
total_clauses: int
critical_findings: int
high_risk_findings: int
analyses: list[ClauseAnalysis] = field(default_factory=list)
def to_json(self) -> str:
return json.dumps(
{
"filename": self.filename,
"summary": {
"total_clauses": self.total_clauses,
"critical": self.critical_findings,
"high_risk": self.high_risk_findings,
},
"findings": [
{
"section": a.section,
"heading": a.heading,
"type": a.clause_type,
"confidence": round(a.classification_confidence, 3),
"risk": a.risk_level,
"reasons": a.risk_reasons,
"entities": {
k: str(v) for k, v in a.entities.items()
},
"preview": a.text_preview[:200],
}
for a in self.analyses
if a.risk_level in ("critical", "high")
],
},
indent=2,
)
Tradeoffs and production considerations
Transformer token limits — Legal clauses frequently exceed 512 tokens. Options include truncation (loses context), sliding window approaches (adds complexity), or long-context models like Longformer (higher compute cost). In practice, sliding windows with overlapping segments and majority voting work well.
Domain adaptation vs. general models — Legal-BERT outperforms base BERT by 3-8% on legal tasks, but fine-tuning a general model on your specific contract corpus often outperforms a generic legal model. The best results come from domain-adapting first, then fine-tuning.
Confidence thresholds — Setting classification confidence thresholds requires balancing false positives (annoying but safe) against false negatives (dangerous). Most production systems set thresholds at 0.7-0.8 and route low-confidence clauses to human review.
Multilingual contracts — International deals often contain contracts in multiple languages. Multilingual models like XLM-RoBERTa handle this, but performance drops compared to monolingual models. Some teams run language-specific models in parallel.
The one thing to remember: A production contract analysis pipeline chains document ingestion, clause segmentation, transformer classification, playbook-based risk scoring, and entity extraction into a system that surfaces the 5% of clauses that actually need human attention from documents that are 95% boilerplate.
See Also
- Python EDiscovery Processing How Python helps lawyers find the right emails, documents, and messages when companies get sued or investigated
- Python Legal Citation Extraction How Python finds and understands references to laws, court cases, and regulations buried inside legal documents
- Python Legal Document Parsing How Python breaks apart complex legal documents into organized, searchable pieces that computers and people can actually use
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.