Full-Text Search with Whoosh — Core Concepts
Why use Whoosh
Most applications eventually need search that’s smarter than SQL LIKE '%keyword%'. Users expect to type a few words and get relevant results ranked by quality — the way Google works, but for your app’s data.
Full-text search engines solve this by building inverted indexes — data structures that map every word to the documents containing it. This makes search fast regardless of how many documents you have.
Whoosh is a full-text search library written entirely in Python. Unlike Elasticsearch or Solr, it doesn’t require a separate server, Java runtime, or network configuration. You import it, build an index, and search — all within your Python process.
Key concepts
Schema
A schema defines what fields your searchable documents have and how each field should be treated:
- TEXT: Searchable content — tokenized, stemmed, indexed for full-text queries
- ID: Exact-match field — stored as-is, used for lookups like document IDs or URLs
- KEYWORD: Comma-separated tags or categories
- STORED: Not searchable, just stored for retrieval (like a snippet or date)
- NUMERIC/DATETIME: Typed fields for range queries
Each field can be stored=True (retrievable in results) or not. TEXT fields are always searchable; STORED-only fields are just along for the ride.
Indexing
Indexing is the process of adding documents to the search index. Whoosh reads each document, breaks text fields into tokens (words), applies transformations (lowercasing, stemming), and builds the inverted index on disk.
The index lives in a directory as a set of files. Multiple processes can read the index simultaneously, but only one can write at a time.
Querying
Whoosh supports a rich query language:
- Single term:
chocolate— finds documents containing “chocolate” - Phrase:
"chocolate cake"— finds the exact phrase - Boolean:
chocolate AND vanillaorchocolate OR caramel - Field-specific:
title:cake AND ingredients:chocolate - Wildcard:
choco*— matches chocolate, chocoholic, etc. - Fuzzy:
chokolate~2— matches within edit distance of 2 (catches typos)
Relevance scoring
Whoosh ranks results using BM25F by default — the same family of algorithms used by modern search engines. Documents score higher when:
- The search term appears frequently in the document
- The search term is rare across all documents (more specific = more relevant)
- The term appears in a “heavier” field (title matches rank higher than body matches)
When to use Whoosh vs. alternatives
| Need | Whoosh | PostgreSQL FTS | Elasticsearch |
|---|---|---|---|
| No external dependencies | ✅ | ✅ (if using PG already) | ❌ |
| Small dataset (<100K docs) | ✅ | ✅ | Overkill |
| Large dataset (millions) | ❌ Slow | ⚠️ OK | ✅ |
| Advanced analytics | ❌ | ❌ | ✅ |
| Embedded/offline apps | ✅ | ❌ | ❌ |
Whoosh fits perfectly for desktop applications, small web apps, development prototypes, and any project where adding Elasticsearch would be overengineering.
Common misconception
“Whoosh is too slow for production use.”
For its intended scale (thousands to low hundreds of thousands of documents), Whoosh is perfectly performant. It searches a 50,000-document index in single-digit milliseconds. Problems arise when people try to use it for millions of documents or high-concurrency web apps with hundreds of simultaneous searches — that’s where Elasticsearch or PostgreSQL full-text search is the better choice.
One thing to remember: Whoosh gives you real search engine functionality — tokenization, stemming, relevance ranking, boolean queries — with zero infrastructure overhead. It’s the right tool when your data fits on one machine and you don’t want to run a search cluster.
See Also
- Ci Cd Why big apps can ship updates every day without turning your phone into a glitchy mess — CI/CD is the behind-the-scenes quality gate and delivery truck.
- Containerization Why does software that works on your computer break on everyone else's? Containers fix that — and they're why Netflix can deploy 100 updates a day without the site going down.
- Python 310 New Features Python 3.10 gave programmers a shape-sorting machine, friendlier error messages, and cleaner ways to say 'this or that' in type hints.
- Python 311 New Features Python 3.11 made everything faster, error messages smarter, and let you catch several mistakes at once instead of stopping at the first one.
- Python 312 New Features Python 3.12 made type hints shorter, f-strings more powerful, and started preparing Python's engine for a world without the GIL.