Full-Text Search with Whoosh — Deep Dive

Build production-quality search with Whoosh: custom analyzers, incremental indexing, faceted search, and integration with web frameworks.

Setting up a Whoosh index

Schema definition

import os
from whoosh import index
from whoosh.fields import Schema, TEXT, ID, KEYWORD, DATETIME, NUMERIC, STORED
from whoosh.analysis import StemmingAnalyzer

# Define schema
schema = Schema(
    id=ID(stored=True, unique=True),
    title=TEXT(stored=True, field_boost=2.0, analyzer=StemmingAnalyzer()),
    body=TEXT(stored=True, analyzer=StemmingAnalyzer()),
    tags=KEYWORD(stored=True, commas=True, scorable=True),
    author=ID(stored=True),
    published_at=DATETIME(stored=True),
    word_count=NUMERIC(stored=True),
    snippet=STORED,  # not searchable, just returned in results
)

# Create index directory
index_dir = "search_index"
if not os.path.exists(index_dir):
    os.makedirs(index_dir)
    ix = index.create_in(index_dir, schema)
else:
    ix = index.open_dir(index_dir)

field_boost=2.0 on the title means matches in titles are weighted twice as heavily as body matches in relevance scoring.

Adding documents

writer = ix.writer()

writer.add_document(
    id="article-001",
    title="Understanding Database Indexes",
    body="Database indexes are data structures that improve query performance...",
    tags="databases,performance,sql",
    author="jane_doe",
    published_at=datetime(2026, 3, 15),
    word_count=1500,
    snippet="A guide to database indexing strategies...",
)

writer.add_document(
    id="article-002",
    title="Python Connection Pooling Guide",
    body="Connection pooling reduces database overhead by reusing connections...",
    tags="python,databases,performance",
    author="john_smith",
    published_at=datetime(2026, 3, 20),
    word_count=2200,
    snippet="How to configure connection pools in Python...",
)

writer.commit()

writer.commit() finalizes the index. Until commit, no changes are visible to searchers.

Updating and deleting documents

writer = ix.writer()

# Update by unique field — replaces existing document with same id
writer.update_document(
    id="article-001",
    title="Understanding Database Indexes (Updated)",
    body="Revised content with new examples...",
    tags="databases,performance,sql,postgresql",
    author="jane_doe",
    published_at=datetime(2026, 3, 28),
    word_count=1800,
)

# Delete by term
writer.delete_by_term("id", "article-002")

writer.commit()

Searching

Basic search

from whoosh.qparser import QueryParser, MultifieldParser

with ix.searcher() as searcher:
    # Search a single field
    parser = QueryParser("body", ix.schema)
    query = parser.parse("database performance")
    results = searcher.search(query, limit=10)

    for hit in results:
        print(f"{hit['title']} (score: {hit.score:.2f})")
        print(f"  Tags: {hit['tags']}")

Multi-field search

with ix.searcher() as searcher:
    parser = MultifieldParser(["title", "body", "tags"], ix.schema)
    query = parser.parse("python database")
    results = searcher.search(query, limit=20)

    print(f"Found {len(results)} results")
    for hit in results:
        print(f"  {hit['title']} — score: {hit.score:.2f}")

Advanced query construction

from whoosh.query import Term, And, Or, DateRange, NumericRange
from datetime import datetime

# Programmatic query building (no parsing needed)
query = And([
    Or([Term("title", "database"), Term("title", "index")]),
    Term("tags", "python"),
    DateRange("published_at", datetime(2026, 1, 1), datetime(2026, 12, 31)),
    NumericRange("word_count", 1000, None),  # at least 1000 words
])

with ix.searcher() as searcher:
    results = searcher.search(query, limit=10)

Highlighting search terms

with ix.searcher() as searcher:
    query = parser.parse("database indexing")
    results = searcher.search(query)

    for hit in results:
        # Returns body text with matching terms wrapped in <b> tags
        highlighted = hit.highlights("body", top=3)
        print(f"{hit['title']}")
        print(f"  ...{highlighted}...")

Custom analyzers

Analyzers control how text is broken into searchable tokens.

Built-in analyzers

from whoosh.analysis import (
    StandardAnalyzer,     # lowercase + stopwords
    StemmingAnalyzer,     # + stemming (running → run)
    FancyAnalyzer,        # + possessives, contractions
    NgramAnalyzer,        # character n-grams for partial matching
    SimpleAnalyzer,       # just whitespace splitting + lowercase
)

# StemmingAnalyzer is the best default for English text
schema = Schema(
    body=TEXT(analyzer=StemmingAnalyzer()),
)

Building a custom analyzer pipeline

from whoosh.analysis import (
    RegexTokenizer,
    LowercaseFilter,
    StopFilter,
    StemFilter,
    CharsetFilter,
)
from whoosh.support.charset import accent_map

# Custom pipeline: tokenize → lowercase → remove accents → remove stops → stem
custom_analyzer = (
    RegexTokenizer(r"\w+")
    | LowercaseFilter()
    | CharsetFilter(accent_map)      # café → cafe
    | StopFilter(lang="en")
    | StemFilter(lang="en")
)

schema = Schema(
    title=TEXT(analyzer=custom_analyzer, stored=True),
    body=TEXT(analyzer=custom_analyzer),
)

N-gram analyzer for autocomplete

from whoosh.analysis import NgramWordAnalyzer

# Generates n-grams for partial matching
ngram_analyzer = NgramWordAnalyzer(minsize=2, maxsize=4)

schema = Schema(
    title=TEXT(stored=True),
    title_ngrams=TEXT(analyzer=ngram_analyzer),  # for autocomplete
)

When indexing, populate both fields:

writer.add_document(
    title="Database Indexing Strategies",
    title_ngrams="Database Indexing Strategies",  # same content, different analyzer
)

Search title_ngrams for autocomplete and title for full searches.

Faceted search

Facets let users filter results by categories:

from whoosh.sorting import FieldFacet

with ix.searcher() as searcher:
    query = parser.parse("database")
    results = searcher.search(query, groupedby="tags")

    # Get facet counts
    tag_groups = results.groups("tags")
    for tag, doc_ids in tag_groups.items():
        print(f"  {tag}: {len(doc_ids)} results")
    # Output:
    #   python: 5 results
    #   sql: 3 results
    #   performance: 7 results

Incremental indexing

For large datasets, rebuilding the entire index on every change is wasteful. Whoosh supports incremental updates:

def sync_index(ix, database_records, last_indexed_at):
    """Only index records modified since last sync."""
    writer = ix.writer()

    # Find modified records
    modified = [r for r in database_records if r.updated_at > last_indexed_at]

    for record in modified:
        writer.update_document(
            id=str(record.id),
            title=record.title,
            body=record.content,
            tags=",".join(record.tags),
        )

    # Find deleted records
    indexed_ids = set()
    with ix.searcher() as searcher:
        for doc_num in searcher.document_numbers():
            stored = searcher.stored_fields(doc_num)
            indexed_ids.add(stored["id"])

    current_ids = {str(r.id) for r in database_records}
    deleted_ids = indexed_ids - current_ids

    for doc_id in deleted_ids:
        writer.delete_by_term("id", doc_id)

    writer.commit(optimize=True)  # merge segments for performance

Integration with Flask

from flask import Flask, request, jsonify
from whoosh.qparser import MultifieldParser

app = Flask(__name__)

# Initialize index at startup
ix = index.open_dir("search_index")

@app.route("/search")
def search():
    q = request.args.get("q", "").strip()
    page = int(request.args.get("page", 1))
    per_page = int(request.args.get("per_page", 10))

    if not q:
        return jsonify({"results": [], "total": 0})

    parser = MultifieldParser(["title", "body", "tags"], ix.schema)

    with ix.searcher() as searcher:
        query = parser.parse(q)
        results = searcher.search_page(query, page, pagelen=per_page)

        items = []
        for hit in results:
            items.append({
                "id": hit["id"],
                "title": hit["title"],
                "snippet": hit.highlights("body", top=2) or hit.get("snippet", ""),
                "score": round(hit.score, 2),
                "tags": hit.get("tags", ""),
            })

        return jsonify({
            "results": items,
            "total": len(results),
            "page": page,
            "pages": results.pagecount,
        })

Performance considerations

Writer locking

Whoosh allows only one writer at a time. For web applications handling concurrent write requests, use AsyncWriter:

from whoosh.writing import AsyncWriter

writer = AsyncWriter(ix)  # queues writes, retries on lock
writer.add_document(title="New Article", body="Content...")
writer.commit()

AsyncWriter retries acquiring the write lock with delays, preventing errors from concurrent write attempts.

Index optimization

Over time, Whoosh creates multiple index segments. Merging them improves search performance:

writer = ix.writer()
writer.commit(optimize=True)  # merge all segments into one

Run this periodically (e.g., nightly) for indexes with frequent updates.

Memory-mapped files

Whoosh uses memory-mapped files for reading, which means the OS handles caching efficiently. For very large indexes, ensure your system has enough RAM to cache the most-accessed index segments.

Benchmarks

On a modern machine (M1 Mac, SSD):

Indexing: ~5,000 documents/second (with stemming analyzer)
Searching: ~2,000 queries/second on a 100K document index
Index size: roughly 30-50% of original text size

These numbers make Whoosh suitable for single-server applications serving moderate traffic. For high-concurrency production systems, consider PostgreSQL full-text search or Elasticsearch.

One thing to remember: Whoosh gives you a complete search engine in pure Python — schema definition, text analysis, relevance ranking, faceting, and highlighting — with no external dependencies. Use it for applications where search quality matters but running a search cluster doesn’t make sense. When you outgrow it, the concepts (schemas, analyzers, scoring) transfer directly to Elasticsearch or Solr.

pythonsearchdatabases