spaCy NLP — Core Concepts

Understand spaCy's pipeline architecture, Doc objects, and pre-trained models that power production-grade text analysis in Python.

spaCy is an open-source NLP library designed for production use. Where NLTK prioritizes teaching and breadth, spaCy prioritizes speed, accuracy, and opinionated defaults. It was created by Explosion AI and first released in 2015.

The Pipeline Model

When you pass text through spaCy, it flows through a sequence of processing steps called a pipeline. The default English pipeline includes:

Tokenizer — splits text into tokens (words, punctuation).
Tagger — assigns part-of-speech tags (noun, verb, adjective).
Parser — determines syntactic dependencies (which word modifies which).
NER — identifies named entities (people, organizations, dates).
Lemmatizer — reduces words to base forms.

Each component reads from and writes to a central Doc object. You load a pipeline (called a “model”) once, then call it on as many texts as you need.

Doc, Token, and Span

These three objects are the core of spaCy’s data model:

Doc — the full processed text. It holds all tokens in order and provides access to sentences, entities, and noun chunks.
Token — a single word or punctuation mark. Each token carries attributes like .text, .lemma_, .pos_, .dep_, and .ent_type_.
Span — a slice of the Doc, such as a sentence or an entity. Spans behave like mini-documents.

This design means spaCy never copies strings unnecessarily. Tokens and spans are views into the same underlying data, which keeps memory usage low even on large documents.

Pre-trained Models

spaCy offers models in multiple sizes:

sm (small) — fast, lower accuracy, no word vectors. Good for quick experiments.
md (medium) — balanced, includes 300-dimensional word vectors.
lg (large) — highest accuracy, largest vectors, most memory.
trf (transformer) — uses a transformer backbone (like RoBERTa). Best accuracy but requires a GPU for reasonable speed.

You install a model as a Python package: python -m spacy download en_core_web_sm. After that, loading it is one line: nlp = spacy.load("en_core_web_sm").

Named Entity Recognition

NER is one of spaCy’s strongest features. The default model recognizes entity types like PERSON, ORG, GPE (countries/cities), DATE, and MONEY. Accessing entities is straightforward:

doc.ents  → tuple of Span objects
each span has .text and .label_

For domain-specific entities (drug names, legal clauses, product SKUs), you can add a custom NER component or fine-tune the existing model on your annotated data using spaCy’s training system.

Matching and Rule-Based Components

Not everything requires a statistical model. spaCy includes:

Matcher — pattern matching on token attributes (text, POS, shape). Think of it as regex that understands grammar.
PhraseMatcher — fast exact-phrase matching against large term lists. Useful for dictionaries of product names or medical terms.
EntityRuler — adds rule-based entities alongside or instead of the statistical NER.

Combining rules with models is a practical pattern. Rules catch known terms with 100% precision; the model generalizes to unseen variations.

How It Compares

Feature	spaCy	NLTK	Stanza
Speed	Very fast (Cython)	Slow (pure Python)	Medium
Pre-trained models	Yes, multiple sizes	Limited	Yes
Training system	Built-in CLI	Manual	Built-in
GPU support	Yes (trf models)	No	Yes
Best for	Production pipelines	Teaching, prototyping	Research accuracy

Common Misunderstanding

People sometimes assume spaCy is a machine learning framework like PyTorch. It is not. spaCy is an NLP library that happens to use ML models internally. You do not write training loops or define neural network layers. Instead, you configure a pipeline, provide annotated data, and run spacy train from the command line.

The one thing to remember: spaCy gives you a fast, opinionated pipeline that turns raw text into structured data — tokens, tags, entities, and dependencies — ready for whatever your application needs next.

pythonspacynlptext-processing