spaCy NLP — Core Concepts
spaCy is an open-source NLP library designed for production use. Where NLTK prioritizes teaching and breadth, spaCy prioritizes speed, accuracy, and opinionated defaults. It was created by Explosion AI and first released in 2015.
The Pipeline Model
When you pass text through spaCy, it flows through a sequence of processing steps called a pipeline. The default English pipeline includes:
- Tokenizer — splits text into tokens (words, punctuation).
- Tagger — assigns part-of-speech tags (noun, verb, adjective).
- Parser — determines syntactic dependencies (which word modifies which).
- NER — identifies named entities (people, organizations, dates).
- Lemmatizer — reduces words to base forms.
Each component reads from and writes to a central Doc object. You load a pipeline (called a “model”) once, then call it on as many texts as you need.
Doc, Token, and Span
These three objects are the core of spaCy’s data model:
- Doc — the full processed text. It holds all tokens in order and provides access to sentences, entities, and noun chunks.
- Token — a single word or punctuation mark. Each token carries attributes like
.text,.lemma_,.pos_,.dep_, and.ent_type_. - Span — a slice of the Doc, such as a sentence or an entity. Spans behave like mini-documents.
This design means spaCy never copies strings unnecessarily. Tokens and spans are views into the same underlying data, which keeps memory usage low even on large documents.
Pre-trained Models
spaCy offers models in multiple sizes:
- sm (small) — fast, lower accuracy, no word vectors. Good for quick experiments.
- md (medium) — balanced, includes 300-dimensional word vectors.
- lg (large) — highest accuracy, largest vectors, most memory.
- trf (transformer) — uses a transformer backbone (like RoBERTa). Best accuracy but requires a GPU for reasonable speed.
You install a model as a Python package: python -m spacy download en_core_web_sm. After that, loading it is one line: nlp = spacy.load("en_core_web_sm").
Named Entity Recognition
NER is one of spaCy’s strongest features. The default model recognizes entity types like PERSON, ORG, GPE (countries/cities), DATE, and MONEY. Accessing entities is straightforward:
doc.ents → tuple of Span objects
each span has .text and .label_
For domain-specific entities (drug names, legal clauses, product SKUs), you can add a custom NER component or fine-tune the existing model on your annotated data using spaCy’s training system.
Matching and Rule-Based Components
Not everything requires a statistical model. spaCy includes:
- Matcher — pattern matching on token attributes (text, POS, shape). Think of it as regex that understands grammar.
- PhraseMatcher — fast exact-phrase matching against large term lists. Useful for dictionaries of product names or medical terms.
- EntityRuler — adds rule-based entities alongside or instead of the statistical NER.
Combining rules with models is a practical pattern. Rules catch known terms with 100% precision; the model generalizes to unseen variations.
How It Compares
| Feature | spaCy | NLTK | Stanza |
|---|---|---|---|
| Speed | Very fast (Cython) | Slow (pure Python) | Medium |
| Pre-trained models | Yes, multiple sizes | Limited | Yes |
| Training system | Built-in CLI | Manual | Built-in |
| GPU support | Yes (trf models) | No | Yes |
| Best for | Production pipelines | Teaching, prototyping | Research accuracy |
Common Misunderstanding
People sometimes assume spaCy is a machine learning framework like PyTorch. It is not. spaCy is an NLP library that happens to use ML models internally. You do not write training loops or define neural network layers. Instead, you configure a pipeline, provide annotated data, and run spacy train from the command line.
The one thing to remember: spaCy gives you a fast, opinionated pipeline that turns raw text into structured data — tokens, tags, entities, and dependencies — ready for whatever your application needs next.
See Also
- Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
- Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
- Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
- Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
- Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.