Full-Text Search with Whoosh — Deep Dive
Setting up a Whoosh index
Schema definition
import os
from whoosh import index
from whoosh.fields import Schema, TEXT, ID, KEYWORD, DATETIME, NUMERIC, STORED
from whoosh.analysis import StemmingAnalyzer
# Define schema
schema = Schema(
id=ID(stored=True, unique=True),
title=TEXT(stored=True, field_boost=2.0, analyzer=StemmingAnalyzer()),
body=TEXT(stored=True, analyzer=StemmingAnalyzer()),
tags=KEYWORD(stored=True, commas=True, scorable=True),
author=ID(stored=True),
published_at=DATETIME(stored=True),
word_count=NUMERIC(stored=True),
snippet=STORED, # not searchable, just returned in results
)
# Create index directory
index_dir = "search_index"
if not os.path.exists(index_dir):
os.makedirs(index_dir)
ix = index.create_in(index_dir, schema)
else:
ix = index.open_dir(index_dir)
field_boost=2.0 on the title means matches in titles are weighted twice as heavily as body matches in relevance scoring.
Adding documents
writer = ix.writer()
writer.add_document(
id="article-001",
title="Understanding Database Indexes",
body="Database indexes are data structures that improve query performance...",
tags="databases,performance,sql",
author="jane_doe",
published_at=datetime(2026, 3, 15),
word_count=1500,
snippet="A guide to database indexing strategies...",
)
writer.add_document(
id="article-002",
title="Python Connection Pooling Guide",
body="Connection pooling reduces database overhead by reusing connections...",
tags="python,databases,performance",
author="john_smith",
published_at=datetime(2026, 3, 20),
word_count=2200,
snippet="How to configure connection pools in Python...",
)
writer.commit()
writer.commit() finalizes the index. Until commit, no changes are visible to searchers.
Updating and deleting documents
writer = ix.writer()
# Update by unique field — replaces existing document with same id
writer.update_document(
id="article-001",
title="Understanding Database Indexes (Updated)",
body="Revised content with new examples...",
tags="databases,performance,sql,postgresql",
author="jane_doe",
published_at=datetime(2026, 3, 28),
word_count=1800,
)
# Delete by term
writer.delete_by_term("id", "article-002")
writer.commit()
Searching
Basic search
from whoosh.qparser import QueryParser, MultifieldParser
with ix.searcher() as searcher:
# Search a single field
parser = QueryParser("body", ix.schema)
query = parser.parse("database performance")
results = searcher.search(query, limit=10)
for hit in results:
print(f"{hit['title']} (score: {hit.score:.2f})")
print(f" Tags: {hit['tags']}")
Multi-field search
with ix.searcher() as searcher:
parser = MultifieldParser(["title", "body", "tags"], ix.schema)
query = parser.parse("python database")
results = searcher.search(query, limit=20)
print(f"Found {len(results)} results")
for hit in results:
print(f" {hit['title']} — score: {hit.score:.2f}")
Advanced query construction
from whoosh.query import Term, And, Or, DateRange, NumericRange
from datetime import datetime
# Programmatic query building (no parsing needed)
query = And([
Or([Term("title", "database"), Term("title", "index")]),
Term("tags", "python"),
DateRange("published_at", datetime(2026, 1, 1), datetime(2026, 12, 31)),
NumericRange("word_count", 1000, None), # at least 1000 words
])
with ix.searcher() as searcher:
results = searcher.search(query, limit=10)
Highlighting search terms
with ix.searcher() as searcher:
query = parser.parse("database indexing")
results = searcher.search(query)
for hit in results:
# Returns body text with matching terms wrapped in <b> tags
highlighted = hit.highlights("body", top=3)
print(f"{hit['title']}")
print(f" ...{highlighted}...")
Custom analyzers
Analyzers control how text is broken into searchable tokens.
Built-in analyzers
from whoosh.analysis import (
StandardAnalyzer, # lowercase + stopwords
StemmingAnalyzer, # + stemming (running → run)
FancyAnalyzer, # + possessives, contractions
NgramAnalyzer, # character n-grams for partial matching
SimpleAnalyzer, # just whitespace splitting + lowercase
)
# StemmingAnalyzer is the best default for English text
schema = Schema(
body=TEXT(analyzer=StemmingAnalyzer()),
)
Building a custom analyzer pipeline
from whoosh.analysis import (
RegexTokenizer,
LowercaseFilter,
StopFilter,
StemFilter,
CharsetFilter,
)
from whoosh.support.charset import accent_map
# Custom pipeline: tokenize → lowercase → remove accents → remove stops → stem
custom_analyzer = (
RegexTokenizer(r"\w+")
| LowercaseFilter()
| CharsetFilter(accent_map) # café → cafe
| StopFilter(lang="en")
| StemFilter(lang="en")
)
schema = Schema(
title=TEXT(analyzer=custom_analyzer, stored=True),
body=TEXT(analyzer=custom_analyzer),
)
N-gram analyzer for autocomplete
from whoosh.analysis import NgramWordAnalyzer
# Generates n-grams for partial matching
ngram_analyzer = NgramWordAnalyzer(minsize=2, maxsize=4)
schema = Schema(
title=TEXT(stored=True),
title_ngrams=TEXT(analyzer=ngram_analyzer), # for autocomplete
)
When indexing, populate both fields:
writer.add_document(
title="Database Indexing Strategies",
title_ngrams="Database Indexing Strategies", # same content, different analyzer
)
Search title_ngrams for autocomplete and title for full searches.
Faceted search
Facets let users filter results by categories:
from whoosh.sorting import FieldFacet
with ix.searcher() as searcher:
query = parser.parse("database")
results = searcher.search(query, groupedby="tags")
# Get facet counts
tag_groups = results.groups("tags")
for tag, doc_ids in tag_groups.items():
print(f" {tag}: {len(doc_ids)} results")
# Output:
# python: 5 results
# sql: 3 results
# performance: 7 results
Incremental indexing
For large datasets, rebuilding the entire index on every change is wasteful. Whoosh supports incremental updates:
def sync_index(ix, database_records, last_indexed_at):
"""Only index records modified since last sync."""
writer = ix.writer()
# Find modified records
modified = [r for r in database_records if r.updated_at > last_indexed_at]
for record in modified:
writer.update_document(
id=str(record.id),
title=record.title,
body=record.content,
tags=",".join(record.tags),
)
# Find deleted records
indexed_ids = set()
with ix.searcher() as searcher:
for doc_num in searcher.document_numbers():
stored = searcher.stored_fields(doc_num)
indexed_ids.add(stored["id"])
current_ids = {str(r.id) for r in database_records}
deleted_ids = indexed_ids - current_ids
for doc_id in deleted_ids:
writer.delete_by_term("id", doc_id)
writer.commit(optimize=True) # merge segments for performance
Integration with Flask
from flask import Flask, request, jsonify
from whoosh.qparser import MultifieldParser
app = Flask(__name__)
# Initialize index at startup
ix = index.open_dir("search_index")
@app.route("/search")
def search():
q = request.args.get("q", "").strip()
page = int(request.args.get("page", 1))
per_page = int(request.args.get("per_page", 10))
if not q:
return jsonify({"results": [], "total": 0})
parser = MultifieldParser(["title", "body", "tags"], ix.schema)
with ix.searcher() as searcher:
query = parser.parse(q)
results = searcher.search_page(query, page, pagelen=per_page)
items = []
for hit in results:
items.append({
"id": hit["id"],
"title": hit["title"],
"snippet": hit.highlights("body", top=2) or hit.get("snippet", ""),
"score": round(hit.score, 2),
"tags": hit.get("tags", ""),
})
return jsonify({
"results": items,
"total": len(results),
"page": page,
"pages": results.pagecount,
})
Performance considerations
Writer locking
Whoosh allows only one writer at a time. For web applications handling concurrent write requests, use AsyncWriter:
from whoosh.writing import AsyncWriter
writer = AsyncWriter(ix) # queues writes, retries on lock
writer.add_document(title="New Article", body="Content...")
writer.commit()
AsyncWriter retries acquiring the write lock with delays, preventing errors from concurrent write attempts.
Index optimization
Over time, Whoosh creates multiple index segments. Merging them improves search performance:
writer = ix.writer()
writer.commit(optimize=True) # merge all segments into one
Run this periodically (e.g., nightly) for indexes with frequent updates.
Memory-mapped files
Whoosh uses memory-mapped files for reading, which means the OS handles caching efficiently. For very large indexes, ensure your system has enough RAM to cache the most-accessed index segments.
Benchmarks
On a modern machine (M1 Mac, SSD):
- Indexing: ~5,000 documents/second (with stemming analyzer)
- Searching: ~2,000 queries/second on a 100K document index
- Index size: roughly 30-50% of original text size
These numbers make Whoosh suitable for single-server applications serving moderate traffic. For high-concurrency production systems, consider PostgreSQL full-text search or Elasticsearch.
One thing to remember: Whoosh gives you a complete search engine in pure Python — schema definition, text analysis, relevance ranking, faceting, and highlighting — with no external dependencies. Use it for applications where search quality matters but running a search cluster doesn’t make sense. When you outgrow it, the concepts (schemas, analyzers, scoring) transfer directly to Elasticsearch or Solr.
See Also
- Ci Cd Why big apps can ship updates every day without turning your phone into a glitchy mess — CI/CD is the behind-the-scenes quality gate and delivery truck.
- Containerization Why does software that works on your computer break on everyone else's? Containers fix that — and they're why Netflix can deploy 100 updates a day without the site going down.
- Python 310 New Features Python 3.10 gave programmers a shape-sorting machine, friendlier error messages, and cleaner ways to say 'this or that' in type hints.
- Python 311 New Features Python 3.11 made everything faster, error messages smarter, and let you catch several mistakes at once instead of stopping at the first one.
- Python 312 New Features Python 3.12 made type hints shorter, f-strings more powerful, and started preparing Python's engine for a world without the GIL.