spaCy NLP — Deep Dive
spaCy’s architecture is built around two principles: streaming document processing and component-based pipelines. Understanding both lets you customize it far beyond the default models.
Pipeline Architecture
Every spaCy model is defined by a config.cfg file that specifies which components to load and in what order. You can inspect it:
import spacy
nlp = spacy.load("en_core_web_sm")
print(nlp.pipe_names)
# ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
Each component is a callable that receives a Doc and returns it modified. This means you can add, remove, or reorder components:
# Remove parser if you only need NER (speeds up processing ~30%)
nlp.remove_pipe("parser")
# Add a custom component
@spacy.Language.component("length_filter")
def length_filter(doc):
doc._.too_short = len(doc) < 5
return doc
nlp.add_pipe("length_filter", after="ner")
Custom attributes (like too_short above) are registered through spaCy’s extension system:
from spacy.tokens import Doc
Doc.set_extension("too_short", default=False)
Efficient Batch Processing
The nlp.pipe() method processes documents in batches, which is significantly faster than calling nlp() in a loop:
texts = ["First document.", "Second document.", ...] # thousands of texts
# Slow: ~500 docs/sec
docs = [nlp(text) for text in texts]
# Fast: ~2,000 docs/sec (CPU, en_core_web_sm)
docs = list(nlp.pipe(texts, batch_size=256, n_process=4))
The n_process parameter spawns multiple worker processes. Each gets its own copy of the model, so memory usage multiplies. On a machine with 16 GB RAM and en_core_web_sm, four processes work well. With transformer models, stick to n_process=1 and use GPU batching instead.
Disabling Unused Components
If you only need entities, skip everything else:
docs = nlp.pipe(texts, disable=["tagger", "parser", "lemmatizer"])
This can double throughput because dependency parsing is the most expensive default component.
Training Custom Models
spaCy v3 introduced a configuration-driven training system. The workflow:
- Annotate data — use Prodigy (Explosion’s annotation tool) or export from Label Studio in spaCy’s JSON format.
- Generate a config —
python -m spacy init config config.cfg --lang en --pipeline ner. - Convert data —
python -m spacy convert train.json ./corpus. - Train —
python -m spacy train config.cfg --output ./model --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy.
The config file controls everything: model architecture, optimizer, batch size, learning rate schedule, and augmentation. Here is a trimmed example for NER:
[training]
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
[training.optimizer]
@optimizers = "Adam.v1"
learn_rate = 0.001
[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
hidden_width = 64
maxout_pieces = 2
Transfer Learning with Transformers
For best accuracy, use a transformer backbone:
python -m spacy init config config.cfg --lang en --pipeline ner --optimize accuracy
This generates a config that uses spacy-transformers with a RoBERTa base model. Training requires a GPU and at least 8 GB VRAM. The resulting model is slower but significantly more accurate on complex entities.
Custom Entity Patterns with EntityRuler
For known terms, rules beat statistical models on precision:
from spacy.language import Language
@Language.factory("drug_ruler")
def create_drug_ruler(nlp, name):
ruler = nlp.add_pipe("entity_ruler", name=name)
patterns = [
{"label": "DRUG", "pattern": "aspirin"},
{"label": "DRUG", "pattern": [{"LOWER": "vitamin"}, {"LOWER": "d"}]},
{"label": "DRUG", "pattern": [{"TEXT": {"REGEX": r"[A-Z]{2,3}-\d{3,5}"}}]},
]
ruler.add_patterns(patterns)
return ruler
Place the ruler before ner to let rules take priority, or after ner to only fill gaps:
nlp.add_pipe("entity_ruler", before="ner") # rules win on conflicts
Dependency Parsing for Information Extraction
The dependency parser connects each token to its syntactic head. This enables structured extraction without regex:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple acquired Beats Electronics for $3 billion in 2014.")
# Find acquisition relationships
for token in doc:
if token.dep_ == "dobj" and token.head.lemma_ == "acquire":
subject = [w for w in token.head.children if w.dep_ == "nsubj"]
print(f"{subject[0].text} acquired {token.text}")
# Apple acquired Beats
For complex extraction patterns, the DependencyMatcher provides a more declarative approach:
from spacy.matcher import DependencyMatcher
matcher = DependencyMatcher(nlp.vocab)
pattern = [
{"RIGHT_ID": "verb", "RIGHT_ATTRS": {"LEMMA": "acquire"}},
{"LEFT_ID": "verb", "REL_OP": ">", "RIGHT_ID": "subject", "RIGHT_ATTRS": {"DEP": "nsubj"}},
{"LEFT_ID": "verb", "REL_OP": ">", "RIGHT_ID": "object", "RIGHT_ATTRS": {"DEP": "dobj"}},
]
matcher.add("ACQUISITION", [pattern])
Vectors and Similarity
Medium and large models include GloVe word vectors. You can compute similarity between documents, spans, or tokens:
nlp = spacy.load("en_core_web_md")
doc1 = nlp("I love programming in Python")
doc2 = nlp("Coding with Python is great")
doc3 = nlp("The weather is nice today")
print(doc1.similarity(doc2)) # ~0.90
print(doc1.similarity(doc3)) # ~0.45
For custom domains, you can replace the default vectors:
python -m spacy init vectors en ./custom_vectors.txt ./custom_vectors_model
Production Deployment Patterns
Serving with FastAPI
from fastapi import FastAPI
import spacy
app = FastAPI()
nlp = spacy.load("en_core_web_sm")
@app.post("/entities")
def extract_entities(text: str):
doc = nlp(text)
return [{"text": ent.text, "label": ent.label_} for ent in doc.ents]
Memory Management
spaCy models stay in memory. For multi-model setups:
- Load models at startup, not per-request.
- Use
nlp.max_lengthto reject unexpectedly large documents before they consume memory. - For transformer models, set
torch.cuda.empty_cache()periodically if GPU memory grows.
Benchmarks (en_core_web_sm, single CPU core)
| Document Length | Throughput | Latency (p95) |
|---|---|---|
| 100 words | ~8,000 docs/sec | 0.3 ms |
| 1,000 words | ~800 docs/sec | 2.5 ms |
| 10,000 words | ~80 docs/sec | 25 ms |
Throughput scales nearly linearly with n_process up to the number of physical CPU cores.
Common Pitfalls
- Loading the model inside a loop.
spacy.load()is expensive (100-500 ms). Load once, reuse everywhere. - Using
.similarity()with small models. Thesmmodel has no real vectors — similarity values are meaningless. Usemdorlg. - Modifying tokens directly. Tokens are read-only views. Use custom extensions or create new Docs instead.
- Ignoring the config system. Manual pipeline assembly was spaCy v2. In v3, the config file is the source of truth for reproducible training.
- Over-relying on default models for domain text. Medical, legal, and financial text needs fine-tuning. Default models were trained on web text and news.
The one thing to remember: spaCy’s power comes from its pipeline architecture — learn to customize, extend, and train pipelines, and you can build NLP systems that are both fast and accurate for your specific domain.
See Also
- Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
- Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
- Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
- Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
- Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.