Semantic Search in Python — Deep Dive

Architect production semantic search in Python with bi-encoder/cross-encoder pipelines, query expansion, relevance tuning, multi-modal search, and performance optimization at scale.

Semantic search at production scale requires more than embedding and nearest-neighbor lookup. You need multi-stage retrieval, query understanding, relevance tuning, and operational infrastructure. This guide covers the architecture that powers real search systems in Python.

1) Two-stage retrieval architecture

Production semantic search uses a bi-encoder for fast candidate retrieval followed by a cross-encoder for precise re-ranking:

from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np

# Stage 1: Bi-encoder for fast retrieval
bi_encoder = SentenceTransformer("BAAI/bge-large-en-v1.5")

# Stage 2: Cross-encoder for re-ranking
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def search(query: str, corpus_embeddings: np.ndarray, corpus_texts: list[str],
           top_k: int = 100, final_k: int = 10) -> list[dict]:
    # Fast retrieval
    query_embedding = bi_encoder.encode(query)
    scores = np.dot(corpus_embeddings, query_embedding)
    top_indices = np.argsort(scores)[-top_k:][::-1]
    candidates = [(corpus_texts[i], scores[i]) for i in top_indices]

    # Precise re-ranking
    pairs = [(query, text) for text, _ in candidates]
    rerank_scores = cross_encoder.predict(pairs)

    results = sorted(
        zip([t for t, _ in candidates], rerank_scores),
        key=lambda x: x[1],
        reverse=True,
    )
    return [{"text": text, "score": float(score)} for text, score in results[:final_k]]

The bi-encoder processes query and documents independently (fast, scalable). The cross-encoder processes query-document pairs jointly (slow, accurate). Combining them gives you the best of both.

2) Query understanding and expansion

Raw user queries are often short and ambiguous. Improve retrieval by expanding them:

from openai import OpenAI

client = OpenAI()

def expand_query(query: str) -> list[str]:
    """Generate alternative phrasings to improve recall."""
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""Generate 3 alternative phrasings of this search query that might match different relevant documents. Return as a JSON list of strings.

Query: {query}"""
        }],
        temperature=0.3,
        response_format={"type": "json_object"},
    )
    alternatives = json.loads(resp.choices[0].message.content).get("queries", [])
    return [query] + alternatives[:3]

def multi_query_search(query: str, index, top_k: int = 10) -> list[dict]:
    """Search with multiple query variants and merge results."""
    expanded = expand_query(query)
    all_results = {}
    for q in expanded:
        results = index.search(q, top_k=top_k)
        for rank, result in enumerate(results):
            doc_id = result["id"]
            rrf_score = 1 / (60 + rank + 1)
            all_results[doc_id] = all_results.get(doc_id, 0) + rrf_score
    return sorted(all_results.items(), key=lambda x: x[1], reverse=True)[:top_k]

Query expansion improves recall by 10-25% on most benchmarks. The cost is one additional LLM call per query, which is acceptable for most applications.

3) Hybrid search implementation

Combine BM25 keyword search with vector search:

from rank_bm25 import BM25Okapi
import numpy as np

class HybridSearcher:
    def __init__(self, documents: list[str], embeddings: np.ndarray, alpha: float = 0.5):
        self.documents = documents
        self.embeddings = embeddings
        self.alpha = alpha  # weight for semantic vs keyword

        # BM25 index
        tokenized = [doc.lower().split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized)

        # Embedding model
        self.encoder = SentenceTransformer("BAAI/bge-large-en-v1.5")

    def search(self, query: str, top_k: int = 10) -> list[dict]:
        # Keyword scores
        bm25_scores = self.bm25.get_scores(query.lower().split())
        bm25_norm = bm25_scores / (bm25_scores.max() + 1e-8)

        # Semantic scores
        query_vec = self.encoder.encode(query)
        semantic_scores = np.dot(self.embeddings, query_vec)
        semantic_norm = (semantic_scores - semantic_scores.min()) / (semantic_scores.max() - semantic_scores.min() + 1e-8)

        # Combine
        combined = self.alpha * semantic_norm + (1 - self.alpha) * bm25_norm
        top_indices = np.argsort(combined)[-top_k:][::-1]

        return [
            {"text": self.documents[i], "score": float(combined[i]),
             "semantic": float(semantic_norm[i]), "keyword": float(bm25_norm[i])}
            for i in top_indices
        ]

The alpha parameter controls the balance. Tune it on your evaluation set. For technical documentation, higher keyword weight helps with exact terms. For conversational queries, higher semantic weight helps.

Search across text, images, and code using models that embed different modalities into the same vector space:

from sentence_transformers import SentenceTransformer
from PIL import Image

# CLIP embeds both images and text into the same space
clip_model = SentenceTransformer("clip-ViT-B-32")

def embed_text(text: str) -> list[float]:
    return clip_model.encode(text).tolist()

def embed_image(image_path: str) -> list[float]:
    img = Image.open(image_path)
    return clip_model.encode(img).tolist()

# Now you can search images with text queries and vice versa
# The vectors live in the same space, so cosine similarity works across modalities

For code search, models like CodeBERT or StarEncoder embed code and natural language descriptions into a shared space, enabling “find me a function that sorts by date” to match actual implementations.

5) Relevance feedback and learning to rank

Improve search over time using implicit feedback signals:

class FeedbackCollector:
    def __init__(self):
        self.feedback_log: list[dict] = []

    def log_search(self, query: str, results: list[str], clicked_index: int | None):
        self.feedback_log.append({
            "query": query,
            "results": results,
            "clicked_index": clicked_index,
            "timestamp": time.time(),
        })

    def compute_click_through_rate(self) -> dict:
        position_clicks = defaultdict(lambda: {"shown": 0, "clicked": 0})
        for entry in self.feedback_log:
            for i in range(len(entry["results"])):
                position_clicks[i]["shown"] += 1
                if entry["clicked_index"] == i:
                    position_clicks[i]["clicked"] += 1
        return {
            pos: data["clicked"] / data["shown"]
            for pos, data in position_clicks.items()
            if data["shown"] > 0
        }

Use click-through data to:

Identify queries with poor results (no clicks or clicks on low-ranked results).
Build training pairs for fine-tuning your embedding model.
Adjust hybrid search weights per query category.

6) Scaling to millions of documents

At scale, approximate nearest neighbor (ANN) search is necessary:

import faiss

def build_ivf_index(embeddings: np.ndarray, nlist: int = 100) -> faiss.IndexIVFFlat:
    dim = embeddings.shape[1]
    quantizer = faiss.IndexFlatIP(dim)
    index = faiss.IndexIVFFlat(quantizer, dim, nlist, faiss.METRIC_INNER_PRODUCT)

    # Train the index on a sample
    faiss.normalize_L2(embeddings)
    index.train(embeddings)
    index.add(embeddings)

    # Set search-time parameters
    index.nprobe = 10  # search 10 of 100 clusters (tradeoff: speed vs recall)
    return index

FAISS IVF with product quantization handles billion-scale datasets. For most applications (under 10M documents), HNSW provides better recall with acceptable memory usage.

Sharding strategy: partition by domain or tenant, run separate indexes per shard, merge results at query time. This also enables per-shard model selection.

7) Evaluation and monitoring

Measure search quality continuously:

def evaluate_search(searcher, eval_set: list[dict], k: int = 10) -> dict:
    """eval_set: [{"query": ..., "relevant_ids": [...]}]"""
    mrr_total = 0
    recall_total = 0

    for case in eval_set:
        results = searcher.search(case["query"], top_k=k)
        result_ids = [r["id"] for r in results]

        # Mean Reciprocal Rank
        for rank, rid in enumerate(result_ids):
            if rid in case["relevant_ids"]:
                mrr_total += 1 / (rank + 1)
                break

        # Recall@k
        hits = len(set(result_ids) & set(case["relevant_ids"]))
        recall_total += hits / len(case["relevant_ids"])

    n = len(eval_set)
    return {"mrr@10": mrr_total / n, "recall@10": recall_total / n}

Run this evaluation on every model or index change. Alert when metrics drop below baselines. In production, also track query latency p50/p95 and zero-result rates.

8) Common production pitfalls

Stale embeddings — when content changes, embeddings must be updated. Build incremental re-embedding into your pipeline.
Query-document mismatch — if queries are short questions and documents are long paragraphs, the embedding model may not bridge the gap well. Consider asymmetric models or query expansion.
Ignoring metadata — filtering by category, date, or permissions after vector search wastes compute. Pre-filter when possible.
Over-relying on one metric — MRR measures first-hit quality; recall measures coverage. Track both.

The one thing to remember: Production semantic search is a multi-stage system — fast bi-encoder retrieval, precise cross-encoder re-ranking, hybrid keyword fusion, and continuous evaluation form the architecture that delivers relevant results at scale.

pythonsemantic-searchnlpembeddingsproduction

Semantic Search in Python — Deep Dive

1) Two-stage retrieval architecture

2) Query understanding and expansion

3) Hybrid search implementation

4) Multi-modal semantic search

5) Relevance feedback and learning to rank

6) Scaling to millions of documents

7) Evaluation and monitoring

8) Common production pitfalls

See Also

Related Topics