Full-Text Search with Whoosh — Core Concepts

Build search functionality into Python applications with Whoosh: schemas, indexing, querying, and relevance ranking.

Why use Whoosh

Most applications eventually need search that’s smarter than SQL LIKE '%keyword%'. Users expect to type a few words and get relevant results ranked by quality — the way Google works, but for your app’s data.

Full-text search engines solve this by building inverted indexes — data structures that map every word to the documents containing it. This makes search fast regardless of how many documents you have.

Whoosh is a full-text search library written entirely in Python. Unlike Elasticsearch or Solr, it doesn’t require a separate server, Java runtime, or network configuration. You import it, build an index, and search — all within your Python process.

Key concepts

Schema

A schema defines what fields your searchable documents have and how each field should be treated:

TEXT: Searchable content — tokenized, stemmed, indexed for full-text queries
ID: Exact-match field — stored as-is, used for lookups like document IDs or URLs
KEYWORD: Comma-separated tags or categories
STORED: Not searchable, just stored for retrieval (like a snippet or date)
NUMERIC/DATETIME: Typed fields for range queries

Each field can be stored=True (retrievable in results) or not. TEXT fields are always searchable; STORED-only fields are just along for the ride.

Indexing

Indexing is the process of adding documents to the search index. Whoosh reads each document, breaks text fields into tokens (words), applies transformations (lowercasing, stemming), and builds the inverted index on disk.

The index lives in a directory as a set of files. Multiple processes can read the index simultaneously, but only one can write at a time.

Querying

Whoosh supports a rich query language:

Single term: chocolate — finds documents containing “chocolate”
Phrase: "chocolate cake" — finds the exact phrase
Boolean: chocolate AND vanilla or chocolate OR caramel
Field-specific: title:cake AND ingredients:chocolate
Wildcard: choco* — matches chocolate, chocoholic, etc.
Fuzzy: chokolate~2 — matches within edit distance of 2 (catches typos)

Relevance scoring

Whoosh ranks results using BM25F by default — the same family of algorithms used by modern search engines. Documents score higher when:

The search term appears frequently in the document
The search term is rare across all documents (more specific = more relevant)
The term appears in a “heavier” field (title matches rank higher than body matches)

When to use Whoosh vs. alternatives

Need	Whoosh	PostgreSQL FTS	Elasticsearch
No external dependencies	✅	✅ (if using PG already)	❌
Small dataset (<100K docs)	✅	✅	Overkill
Large dataset (millions)	❌ Slow	⚠️ OK	✅
Advanced analytics	❌	❌	✅
Embedded/offline apps	✅	❌	❌

Whoosh fits perfectly for desktop applications, small web apps, development prototypes, and any project where adding Elasticsearch would be overengineering.

Common misconception

“Whoosh is too slow for production use.”

For its intended scale (thousands to low hundreds of thousands of documents), Whoosh is perfectly performant. It searches a 50,000-document index in single-digit milliseconds. Problems arise when people try to use it for millions of documents or high-concurrency web apps with hundreds of simultaneous searches — that’s where Elasticsearch or PostgreSQL full-text search is the better choice.

One thing to remember: Whoosh gives you real search engine functionality — tokenization, stemming, relevance ranking, boolean queries — with zero infrastructure overhead. It’s the right tool when your data fits on one machine and you don’t want to run a search cluster.

pythonsearchdatabases