spaCy NLP — Deep Dive

Build custom spaCy pipelines, train domain-specific models, and optimize throughput for production NLP systems.

spaCy’s architecture is built around two principles: streaming document processing and component-based pipelines. Understanding both lets you customize it far beyond the default models.

Pipeline Architecture

Every spaCy model is defined by a config.cfg file that specifies which components to load and in what order. You can inspect it:

import spacy

nlp = spacy.load("en_core_web_sm")
print(nlp.pipe_names)
# ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

Each component is a callable that receives a Doc and returns it modified. This means you can add, remove, or reorder components:

# Remove parser if you only need NER (speeds up processing ~30%)
nlp.remove_pipe("parser")

# Add a custom component
@spacy.Language.component("length_filter")
def length_filter(doc):
    doc._.too_short = len(doc) < 5
    return doc

nlp.add_pipe("length_filter", after="ner")

Custom attributes (like too_short above) are registered through spaCy’s extension system:

from spacy.tokens import Doc
Doc.set_extension("too_short", default=False)

Efficient Batch Processing

The nlp.pipe() method processes documents in batches, which is significantly faster than calling nlp() in a loop:

texts = ["First document.", "Second document.", ...]  # thousands of texts

# Slow: ~500 docs/sec
docs = [nlp(text) for text in texts]

# Fast: ~2,000 docs/sec (CPU, en_core_web_sm)
docs = list(nlp.pipe(texts, batch_size=256, n_process=4))

The n_process parameter spawns multiple worker processes. Each gets its own copy of the model, so memory usage multiplies. On a machine with 16 GB RAM and en_core_web_sm, four processes work well. With transformer models, stick to n_process=1 and use GPU batching instead.

Disabling Unused Components

If you only need entities, skip everything else:

docs = nlp.pipe(texts, disable=["tagger", "parser", "lemmatizer"])

This can double throughput because dependency parsing is the most expensive default component.

Training Custom Models

spaCy v3 introduced a configuration-driven training system. The workflow:

Annotate data — use Prodigy (Explosion’s annotation tool) or export from Label Studio in spaCy’s JSON format.
Generate a config — python -m spacy init config config.cfg --lang en --pipeline ner.
Convert data — python -m spacy convert train.json ./corpus.
Train — python -m spacy train config.cfg --output ./model --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy.

The config file controls everything: model architecture, optimizer, batch size, learning rate schedule, and augmentation. Here is a trimmed example for NER:

[training]
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200

[training.optimizer]
@optimizers = "Adam.v1"
learn_rate = 0.001

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
hidden_width = 64
maxout_pieces = 2

Transfer Learning with Transformers

For best accuracy, use a transformer backbone:

python -m spacy init config config.cfg --lang en --pipeline ner --optimize accuracy

This generates a config that uses spacy-transformers with a RoBERTa base model. Training requires a GPU and at least 8 GB VRAM. The resulting model is slower but significantly more accurate on complex entities.

Custom Entity Patterns with EntityRuler

For known terms, rules beat statistical models on precision:

from spacy.language import Language

@Language.factory("drug_ruler")
def create_drug_ruler(nlp, name):
    ruler = nlp.add_pipe("entity_ruler", name=name)
    patterns = [
        {"label": "DRUG", "pattern": "aspirin"},
        {"label": "DRUG", "pattern": [{"LOWER": "vitamin"}, {"LOWER": "d"}]},
        {"label": "DRUG", "pattern": [{"TEXT": {"REGEX": r"[A-Z]{2,3}-\d{3,5}"}}]},
    ]
    ruler.add_patterns(patterns)
    return ruler

Place the ruler before ner to let rules take priority, or after ner to only fill gaps:

nlp.add_pipe("entity_ruler", before="ner")  # rules win on conflicts

Dependency Parsing for Information Extraction

The dependency parser connects each token to its syntactic head. This enables structured extraction without regex:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple acquired Beats Electronics for $3 billion in 2014.")

# Find acquisition relationships
for token in doc:
    if token.dep_ == "dobj" and token.head.lemma_ == "acquire":
        subject = [w for w in token.head.children if w.dep_ == "nsubj"]
        print(f"{subject[0].text} acquired {token.text}")
        # Apple acquired Beats

For complex extraction patterns, the DependencyMatcher provides a more declarative approach:

from spacy.matcher import DependencyMatcher

matcher = DependencyMatcher(nlp.vocab)
pattern = [
    {"RIGHT_ID": "verb", "RIGHT_ATTRS": {"LEMMA": "acquire"}},
    {"LEFT_ID": "verb", "REL_OP": ">", "RIGHT_ID": "subject", "RIGHT_ATTRS": {"DEP": "nsubj"}},
    {"LEFT_ID": "verb", "REL_OP": ">", "RIGHT_ID": "object", "RIGHT_ATTRS": {"DEP": "dobj"}},
]
matcher.add("ACQUISITION", [pattern])

Vectors and Similarity

Medium and large models include GloVe word vectors. You can compute similarity between documents, spans, or tokens:

nlp = spacy.load("en_core_web_md")
doc1 = nlp("I love programming in Python")
doc2 = nlp("Coding with Python is great")
doc3 = nlp("The weather is nice today")

print(doc1.similarity(doc2))  # ~0.90
print(doc1.similarity(doc3))  # ~0.45

For custom domains, you can replace the default vectors:

python -m spacy init vectors en ./custom_vectors.txt ./custom_vectors_model

Production Deployment Patterns

Serving with FastAPI

from fastapi import FastAPI
import spacy

app = FastAPI()
nlp = spacy.load("en_core_web_sm")

@app.post("/entities")
def extract_entities(text: str):
    doc = nlp(text)
    return [{"text": ent.text, "label": ent.label_} for ent in doc.ents]

Memory Management

spaCy models stay in memory. For multi-model setups:

Load models at startup, not per-request.
Use nlp.max_length to reject unexpectedly large documents before they consume memory.
For transformer models, set torch.cuda.empty_cache() periodically if GPU memory grows.

Benchmarks (en_core_web_sm, single CPU core)

Document Length	Throughput	Latency (p95)
100 words	~8,000 docs/sec	0.3 ms
1,000 words	~800 docs/sec	2.5 ms
10,000 words	~80 docs/sec	25 ms

Throughput scales nearly linearly with n_process up to the number of physical CPU cores.

Common Pitfalls

Loading the model inside a loop. spacy.load() is expensive (100-500 ms). Load once, reuse everywhere.
Using .similarity() with small models. The sm model has no real vectors — similarity values are meaningless. Use md or lg.
Modifying tokens directly. Tokens are read-only views. Use custom extensions or create new Docs instead.
Ignoring the config system. Manual pipeline assembly was spaCy v2. In v3, the config file is the source of truth for reproducible training.
Over-relying on default models for domain text. Medical, legal, and financial text needs fine-tuning. Default models were trained on web text and news.

The one thing to remember: spaCy’s power comes from its pipeline architecture — learn to customize, extend, and train pipelines, and you can build NLP systems that are both fast and accurate for your specific domain.

pythonspacynlptext-processing