Named Entity Recognition in Python — Deep Dive
Production NER rarely works well out of the box. This guide covers building, training, and deploying custom entity recognizers for real-world applications.
Quick Start: NER with spaCy
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Elon Musk sold $4 billion of Tesla stock on January 15, 2025.")
for ent in doc.ents:
print(f"{ent.text:25s} {ent.label_:10s} {ent.start_char}-{ent.end_char}")
# Elon Musk PERSON 0-9
# $4 billion MONEY 15-26
# Tesla ORG 30-35
# January 15, 2025 DATE 45-61
NER with Hugging Face Transformers
from transformers import pipeline
ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")
results = ner("Marie Curie received the Nobel Prize in Paris in 1903.")
for entity in results:
print(f"{entity['word']:20s} {entity['entity_group']:6s} {entity['score']:.3f}")
# Marie Curie PER 0.998
# Nobel Prize MISC 0.976
# Paris LOC 0.999
The aggregation_strategy="simple" merges sub-word tokens (like “Cu” + “##rie”) back into complete entities.
Training a Custom NER Model with spaCy
Step 1: Prepare Training Data
spaCy expects training data in .spacy binary format. Start from JSON annotations:
import json
from spacy.tokens import DocBin
training_data = [
("Aspirin 500mg twice daily for 7 days", {"entities": [(0, 7, "DRUG"), (8, 13, "DOSAGE"), (20, 25, "FREQUENCY"), (30, 36, "DURATION")]}),
("Prescribe Metformin 850mg with meals", {"entities": [(9, 18, "DRUG"), (19, 24, "DOSAGE")]}),
]
nlp = spacy.blank("en")
db = DocBin()
for text, annotations in training_data:
doc = nlp.make_doc(text)
ents = []
for start, end, label in annotations["entities"]:
span = doc.char_span(start, end, label=label)
if span is not None:
ents.append(span)
doc.ents = ents
db.add(doc)
db.to_disk("./corpus/train.spacy")
Step 2: Generate and Customize Config
python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency
Key config sections to tune:
[training]
patience = 1600
max_steps = 20000
[training.batcher]
@batchers = "spacy.batch_by_words.v1"
size = 1000
[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
hidden_width = 128
maxout_pieces = 3
Step 3: Train
python -m spacy train config.cfg \
--output ./models/custom_ner \
--paths.train ./corpus/train.spacy \
--paths.dev ./corpus/dev.spacy \
--gpu-id 0
Training on 1,000 annotated examples typically takes 10-30 minutes on CPU. Expect F1 scores of 75-85% on domain-specific entities, improving to 90%+ with 5,000+ examples.
Fine-tuning a Transformer NER Model
For maximum accuracy, fine-tune a BERT-based model:
from transformers import (
AutoTokenizer, AutoModelForTokenClassification,
TrainingArguments, Trainer, DataCollatorForTokenClassification
)
from datasets import Dataset
import numpy as np
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
label_list = ["O", "B-DRUG", "I-DRUG", "B-DOSAGE", "I-DOSAGE",
"B-FREQUENCY", "I-FREQUENCY", "B-DURATION", "I-DURATION"]
label2id = {l: i for i, l in enumerate(label_list)}
id2label = {i: l for i, l in enumerate(label_list)}
def tokenize_and_align_labels(examples):
tokenized = tokenizer(
examples["tokens"], truncation=True, is_split_into_words=True, padding="max_length", max_length=128
)
labels = []
for i, label in enumerate(examples["ner_tags"]):
word_ids = tokenized.word_ids(batch_index=i)
label_ids = []
previous_word_idx = None
for word_idx in word_ids:
if word_idx is None:
label_ids.append(-100)
elif word_idx != previous_word_idx:
label_ids.append(label[word_idx])
else:
# Sub-word token: use I- tag if original is B-
orig = label[word_idx]
label_ids.append(orig)
previous_word_idx = word_idx
labels.append(label_ids)
tokenized["labels"] = labels
return tokenized
model = AutoModelForTokenClassification.from_pretrained(
model_name, num_labels=len(label_list), id2label=id2label, label2id=label2id
)
training_args = TrainingArguments(
output_dir="./ner_model",
num_train_epochs=5,
per_device_train_batch_size=16,
learning_rate=2e-5,
weight_decay=0.01,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
data_collator = DataCollatorForTokenClassification(tokenizer)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=data_collator,
tokenizer=tokenizer,
)
trainer.train()
Hybrid Approach: Rules + Model
The most robust production systems combine statistical models with rule-based overrides:
import spacy
from spacy.language import Language
from spacy.tokens import Span
@Language.component("product_sku_ruler")
def product_sku_ruler(doc):
import re
new_ents = list(doc.ents)
for match in re.finditer(r'[A-Z]{2,4}-\d{4,8}', doc.text):
span = doc.char_span(match.start(), match.end(), label="PRODUCT_SKU")
if span is not None:
# Check for overlap with existing entities
overlap = any(
span.start < ent.end and span.end > ent.start
for ent in new_ents
)
if not overlap:
new_ents.append(span)
doc.ents = sorted(new_ents, key=lambda e: e.start)
return doc
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("product_sku_ruler", after="ner")
doc = nlp("Ship SKU AB-12345 to Amazon warehouse in Seattle by Friday.")
for ent in doc.ents:
print(f"{ent.text}: {ent.label_}")
# AB-12345: PRODUCT_SKU
# Amazon: ORG
# Seattle: GPE
# Friday: DATE
Rules handle known patterns perfectly. The model handles everything else. This combination outperforms either approach alone.
Entity Linking
NER finds mentions (“Apple”); entity linking resolves which entity is meant (Apple Inc. vs. apple the fruit). This is sometimes called entity disambiguation.
# spaCy's entity linker connects to a knowledge base
# Simplified example using string matching
knowledge_base = {
"Apple Inc.": {"type": "company", "id": "Q312"},
"apple": {"type": "fruit", "id": "Q89"},
}
def link_entity(entity_text, context):
"""Naive linking based on capitalization and context."""
if entity_text[0].isupper() and any(
word in context.lower() for word in ["stock", "ceo", "revenue", "iphone"]
):
return knowledge_base.get("Apple Inc.")
return knowledge_base.get("apple")
Production entity linking systems use embedding similarity against a knowledge base (Wikipedia, Wikidata) and are available through libraries like REL or spaCy’s built-in EntityLinker component.
Evaluation and Error Analysis
from spacy.scorer import Scorer
from spacy.training import Example
scorer = Scorer()
examples = []
for text, annotations in test_data:
pred_doc = nlp(text)
ref_doc = nlp.make_doc(text)
ref_doc.ents = [ref_doc.char_span(s, e, label=l) for s, e, l in annotations["entities"]]
examples.append(Example(pred_doc, ref_doc))
scores = scorer.score(examples)
print(f"NER precision: {scores['ents_p']:.3f}")
print(f"NER recall: {scores['ents_r']:.3f}")
print(f"NER F1: {scores['ents_f']:.3f}")
Common Error Categories
- Boundary errors — “New York City” detected as “New York” + “City” (two entities instead of one).
- Type confusion — “Washington” labeled as PERSON instead of GPE.
- Missing entities — uncommon names or novel organizations not seen in training.
- False positives — common words incorrectly tagged (e.g., “May” as a person when it is a month).
Track error categories separately. Boundary errors suggest you need more consistent annotation guidelines. Type confusion means your training data needs more context-dependent examples.
Production Deployment
Serving with FastAPI
from fastapi import FastAPI
import spacy
app = FastAPI()
nlp = spacy.load("./models/custom_ner/model-best")
@app.post("/entities")
def extract(text: str):
doc = nlp(text)
return {
"entities": [
{"text": ent.text, "label": ent.label_,
"start": ent.start_char, "end": ent.end_char}
for ent in doc.ents
]
}
Performance Benchmarks
| Model | Docs/sec (CPU) | F1 (CoNLL-2003) | Memory |
|---|---|---|---|
| spaCy en_core_web_sm | 10,000 | 0.85 | 50 MB |
| spaCy en_core_web_trf | 150 | 0.90 | 500 MB |
| BERT-base fine-tuned | 100 | 0.92 | 440 MB |
| Flair (stacked) | 50 | 0.93 | 800 MB |
Common Pitfalls
- Inconsistent annotation. If annotators disagree on whether “Dr. Smith” includes the title, the model learns noise. Create clear annotation guidelines before labeling.
- Training on one domain, deploying on another. A news-trained NER model will miss “CRISPR-Cas9” in biomedical text. Always evaluate on in-domain data.
- Ignoring entity boundaries. Partial credit feels good in development but masks real problems. Evaluate with exact match F1.
- Not handling overlapping entities. Standard BIO tagging cannot represent overlapping entities (“New York” as both LOC and part of “New York Times” as ORG). If your data has overlaps, use a span-based model instead of sequence labeling.
- Skipping entity linking. Extracting “Apple” 500 times is less useful than knowing 480 refer to the company and 20 to the fruit.
The one thing to remember: Production NER combines statistical models for generalization with rules for known patterns, and the biggest accuracy gains come from high-quality, domain-specific training data — not from switching to a fancier model architecture.
See Also
- Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
- Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
- Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
- Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
- Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.