Named Entity Recognition in Python — Deep Dive

Production NER rarely works well out of the box. This guide covers building, training, and deploying custom entity recognizers for real-world applications.

Quick Start: NER with spaCy

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Elon Musk sold $4 billion of Tesla stock on January 15, 2025.")

for ent in doc.ents:
    print(f"{ent.text:25s} {ent.label_:10s} {ent.start_char}-{ent.end_char}")
# Elon Musk                 PERSON     0-9
# $4 billion                MONEY      15-26
# Tesla                     ORG        30-35
# January 15, 2025          DATE       45-61

NER with Hugging Face Transformers

from transformers import pipeline

ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")

results = ner("Marie Curie received the Nobel Prize in Paris in 1903.")
for entity in results:
    print(f"{entity['word']:20s} {entity['entity_group']:6s} {entity['score']:.3f}")
# Marie Curie          PER    0.998
# Nobel Prize          MISC   0.976
# Paris                LOC    0.999

The aggregation_strategy="simple" merges sub-word tokens (like “Cu” + “##rie”) back into complete entities.

Training a Custom NER Model with spaCy

Step 1: Prepare Training Data

spaCy expects training data in .spacy binary format. Start from JSON annotations:

import json
from spacy.tokens import DocBin

training_data = [
    ("Aspirin 500mg twice daily for 7 days", {"entities": [(0, 7, "DRUG"), (8, 13, "DOSAGE"), (20, 25, "FREQUENCY"), (30, 36, "DURATION")]}),
    ("Prescribe Metformin 850mg with meals", {"entities": [(9, 18, "DRUG"), (19, 24, "DOSAGE")]}),
]

nlp = spacy.blank("en")
db = DocBin()

for text, annotations in training_data:
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in annotations["entities"]:
        span = doc.char_span(start, end, label=label)
        if span is not None:
            ents.append(span)
    doc.ents = ents
    db.add(doc)

db.to_disk("./corpus/train.spacy")

Step 2: Generate and Customize Config

python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency

Key config sections to tune:

[training]
patience = 1600
max_steps = 20000

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
size = 1000

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
hidden_width = 128
maxout_pieces = 3

Step 3: Train

python -m spacy train config.cfg \
    --output ./models/custom_ner \
    --paths.train ./corpus/train.spacy \
    --paths.dev ./corpus/dev.spacy \
    --gpu-id 0

Training on 1,000 annotated examples typically takes 10-30 minutes on CPU. Expect F1 scores of 75-85% on domain-specific entities, improving to 90%+ with 5,000+ examples.

Fine-tuning a Transformer NER Model

For maximum accuracy, fine-tune a BERT-based model:

from transformers import (
    AutoTokenizer, AutoModelForTokenClassification,
    TrainingArguments, Trainer, DataCollatorForTokenClassification
)
from datasets import Dataset
import numpy as np

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

label_list = ["O", "B-DRUG", "I-DRUG", "B-DOSAGE", "I-DOSAGE",
              "B-FREQUENCY", "I-FREQUENCY", "B-DURATION", "I-DURATION"]
label2id = {l: i for i, l in enumerate(label_list)}
id2label = {i: l for i, l in enumerate(label_list)}

def tokenize_and_align_labels(examples):
    tokenized = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True, padding="max_length", max_length=128
    )
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized.word_ids(batch_index=i)
        label_ids = []
        previous_word_idx = None
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                # Sub-word token: use I- tag if original is B-
                orig = label[word_idx]
                label_ids.append(orig)
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized["labels"] = labels
    return tokenized

model = AutoModelForTokenClassification.from_pretrained(
    model_name, num_labels=len(label_list), id2label=id2label, label2id=label2id
)

training_args = TrainingArguments(
    output_dir="./ner_model",
    num_train_epochs=5,
    per_device_train_batch_size=16,
    learning_rate=2e-5,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

data_collator = DataCollatorForTokenClassification(tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
)
trainer.train()

Hybrid Approach: Rules + Model

The most robust production systems combine statistical models with rule-based overrides:

import spacy
from spacy.language import Language
from spacy.tokens import Span

@Language.component("product_sku_ruler")
def product_sku_ruler(doc):
    import re
    new_ents = list(doc.ents)
    for match in re.finditer(r'[A-Z]{2,4}-\d{4,8}', doc.text):
        span = doc.char_span(match.start(), match.end(), label="PRODUCT_SKU")
        if span is not None:
            # Check for overlap with existing entities
            overlap = any(
                span.start < ent.end and span.end > ent.start
                for ent in new_ents
            )
            if not overlap:
                new_ents.append(span)
    doc.ents = sorted(new_ents, key=lambda e: e.start)
    return doc

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("product_sku_ruler", after="ner")

doc = nlp("Ship SKU AB-12345 to Amazon warehouse in Seattle by Friday.")
for ent in doc.ents:
    print(f"{ent.text}: {ent.label_}")
# AB-12345: PRODUCT_SKU
# Amazon: ORG
# Seattle: GPE
# Friday: DATE

Rules handle known patterns perfectly. The model handles everything else. This combination outperforms either approach alone.

Entity Linking

NER finds mentions (“Apple”); entity linking resolves which entity is meant (Apple Inc. vs. apple the fruit). This is sometimes called entity disambiguation.

# spaCy's entity linker connects to a knowledge base
# Simplified example using string matching
knowledge_base = {
    "Apple Inc.": {"type": "company", "id": "Q312"},
    "apple": {"type": "fruit", "id": "Q89"},
}

def link_entity(entity_text, context):
    """Naive linking based on capitalization and context."""
    if entity_text[0].isupper() and any(
        word in context.lower() for word in ["stock", "ceo", "revenue", "iphone"]
    ):
        return knowledge_base.get("Apple Inc.")
    return knowledge_base.get("apple")

Production entity linking systems use embedding similarity against a knowledge base (Wikipedia, Wikidata) and are available through libraries like REL or spaCy’s built-in EntityLinker component.

Evaluation and Error Analysis

from spacy.scorer import Scorer
from spacy.training import Example

scorer = Scorer()
examples = []

for text, annotations in test_data:
    pred_doc = nlp(text)
    ref_doc = nlp.make_doc(text)
    ref_doc.ents = [ref_doc.char_span(s, e, label=l) for s, e, l in annotations["entities"]]
    examples.append(Example(pred_doc, ref_doc))

scores = scorer.score(examples)
print(f"NER precision: {scores['ents_p']:.3f}")
print(f"NER recall:    {scores['ents_r']:.3f}")
print(f"NER F1:        {scores['ents_f']:.3f}")

Common Error Categories

  1. Boundary errors — “New York City” detected as “New York” + “City” (two entities instead of one).
  2. Type confusion — “Washington” labeled as PERSON instead of GPE.
  3. Missing entities — uncommon names or novel organizations not seen in training.
  4. False positives — common words incorrectly tagged (e.g., “May” as a person when it is a month).

Track error categories separately. Boundary errors suggest you need more consistent annotation guidelines. Type confusion means your training data needs more context-dependent examples.

Production Deployment

Serving with FastAPI

from fastapi import FastAPI
import spacy

app = FastAPI()
nlp = spacy.load("./models/custom_ner/model-best")

@app.post("/entities")
def extract(text: str):
    doc = nlp(text)
    return {
        "entities": [
            {"text": ent.text, "label": ent.label_,
             "start": ent.start_char, "end": ent.end_char}
            for ent in doc.ents
        ]
    }

Performance Benchmarks

ModelDocs/sec (CPU)F1 (CoNLL-2003)Memory
spaCy en_core_web_sm10,0000.8550 MB
spaCy en_core_web_trf1500.90500 MB
BERT-base fine-tuned1000.92440 MB
Flair (stacked)500.93800 MB

Common Pitfalls

  1. Inconsistent annotation. If annotators disagree on whether “Dr. Smith” includes the title, the model learns noise. Create clear annotation guidelines before labeling.
  2. Training on one domain, deploying on another. A news-trained NER model will miss “CRISPR-Cas9” in biomedical text. Always evaluate on in-domain data.
  3. Ignoring entity boundaries. Partial credit feels good in development but masks real problems. Evaluate with exact match F1.
  4. Not handling overlapping entities. Standard BIO tagging cannot represent overlapping entities (“New York” as both LOC and part of “New York Times” as ORG). If your data has overlaps, use a span-based model instead of sequence labeling.
  5. Skipping entity linking. Extracting “Apple” 500 times is less useful than knowing 480 refer to the company and 20 to the fruit.

The one thing to remember: Production NER combines statistical models for generalization with rules for known patterns, and the biggest accuracy gains come from high-quality, domain-specific training data — not from switching to a fancier model architecture.

pythonnernlptext-processing

See Also

  • Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
  • Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
  • Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
  • Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
  • Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.