Python Write-Behind Cache — Deep Dive

Build a production write-behind cache in Python with background flushing, batch writes, failure recovery, and monitoring hooks.

Architecture

A write-behind cache has three components: the cache itself, a dirty-tracking mechanism, and a background flusher. The flusher runs on a timer or threshold and pushes accumulated changes to the persistent store.

App → Cache (immediate ack) → Dirty Set → Background Flusher → Database

The dirty set tracks which keys have been modified since the last flush. On each flush cycle, the flusher reads all dirty keys from the cache and writes them to the database in a batch.

Implementation

import json
import logging
import threading
import time
from typing import Any, Callable, Dict, Optional

import redis

logger = logging.getLogger(__name__)


class WriteBehindCache:
    """Write-behind cache with background flushing to a persistent store."""

    def __init__(
        self,
        redis_client: redis.Redis,
        flush_fn: Callable[[Dict[str, dict]], None],
        flush_interval: float = 5.0,
        flush_threshold: int = 100,
        namespace: str = "wbc",
    ):
        self.redis = redis_client
        self.flush_fn = flush_fn
        self.flush_interval = flush_interval
        self.flush_threshold = flush_threshold
        self.ns = namespace
        self._dirty_key = f"{namespace}:dirty"
        self._stop = threading.Event()
        self._flusher = threading.Thread(
            target=self._flush_loop, daemon=True, name="wbc-flusher"
        )
        self._flush_count = 0
        self._error_count = 0

    def start(self) -> None:
        """Start the background flusher."""
        self._flusher.start()
        logger.info("Write-behind flusher started (interval=%.1fs)", self.flush_interval)

    def stop(self, timeout: float = 10.0) -> None:
        """Stop the flusher and perform a final flush."""
        self._stop.set()
        self._flusher.join(timeout=timeout)
        self._flush_dirty()  # Final flush on shutdown
        logger.info("Write-behind flusher stopped (total flushes: %d)", self._flush_count)

    def write(self, key: str, data: dict, ttl: int = 3600) -> None:
        """Write to cache immediately, mark as dirty for later flush."""
        cache_key = f"{self.ns}:data:{key}"
        self.redis.setex(cache_key, ttl, json.dumps(data))
        self.redis.sadd(self._dirty_key, key)

        # Check threshold
        dirty_count = self.redis.scard(self._dirty_key)
        if dirty_count >= self.flush_threshold:
            threading.Thread(target=self._flush_dirty, daemon=True).start()

    def read(self, key: str) -> Optional[dict]:
        """Read from cache (always reflects latest writes)."""
        cache_key = f"{self.ns}:data:{key}"
        raw = self.redis.get(cache_key)
        if raw is None:
            return None
        return json.loads(raw)

    def _flush_loop(self) -> None:
        """Background loop that flushes dirty entries periodically."""
        while not self._stop.is_set():
            self._stop.wait(timeout=self.flush_interval)
            if not self._stop.is_set():
                self._flush_dirty()

    def _flush_dirty(self) -> None:
        """Collect dirty keys and flush to the backing store."""
        # Atomically grab all dirty keys and clear the set
        pipe = self.redis.pipeline()
        pipe.smembers(self._dirty_key)
        pipe.delete(self._dirty_key)
        results = pipe.execute()
        dirty_keys: set = results[0]

        if not dirty_keys:
            return

        # Read current values for all dirty keys
        batch: Dict[str, dict] = {}
        for key in dirty_keys:
            key_str = key.decode() if isinstance(key, bytes) else key
            cache_key = f"{self.ns}:data:{key_str}"
            raw = self.redis.get(cache_key)
            if raw is not None:
                batch[key_str] = json.loads(raw)

        if not batch:
            return

        try:
            self.flush_fn(batch)
            self._flush_count += 1
            logger.info("Flushed %d entries to backing store", len(batch))
        except Exception:
            # Re-add keys to dirty set for retry on next cycle
            logger.exception("Flush failed for %d entries, re-queuing", len(batch))
            self._error_count += 1
            if dirty_keys:
                self.redis.sadd(self._dirty_key, *dirty_keys)

    @property
    def stats(self) -> dict:
        """Return operational statistics."""
        return {
            "flush_count": self._flush_count,
            "error_count": self._error_count,
            "pending_dirty": self.redis.scard(self._dirty_key),
        }

Database flush function

The flush_fn callback receives a dictionary of key-value pairs and must persist them. Here’s an example using SQLAlchemy’s bulk operations:

from sqlalchemy.dialects.postgresql import insert
from sqlalchemy.orm import Session


def create_flush_fn(session_factory):
    """Create a flush function that bulk-upserts to PostgreSQL."""

    def flush(batch: Dict[str, dict]) -> None:
        session: Session = session_factory()
        try:
            rows = [{"id": k, **v} for k, v in batch.items()]
            stmt = insert(PageView.__table__).values(rows)
            stmt = stmt.on_conflict_do_update(
                index_elements=["id"],
                set_={c.name: c for c in stmt.excluded if c.name != "id"},
            )
            session.execute(stmt)
            session.commit()
        except Exception:
            session.rollback()
            raise
        finally:
            session.close()

    return flush

Write coalescing

One major advantage of write-behind is automatic write coalescing. If a user updates their profile picture three times in ten seconds, only the final state gets flushed. The dirty set tracks keys, not individual operations, so intermediate values are naturally superseded.

This is particularly powerful for counters:

def increment_view_count(cache: WriteBehindCache, page_id: str) -> None:
    """Increment a page view counter with write-behind."""
    current = cache.read(page_id)
    count = (current.get("views", 0) if current else 0) + 1
    cache.write(page_id, {"page_id": page_id, "views": count})
    # Hundreds of increments may collapse into a single DB write

Failure scenarios

Cache crash before flush

Unflushed data is lost. Mitigation strategies:

Shorter flush intervals — reduces the data-loss window from minutes to seconds.
Redis AOF persistence — with appendfsync everysec, Redis persists writes to disk within one second. On restart, dirty keys survive.
Dual-write fallback — for critical data, fall back to write-through if the flusher reports errors.

Flush function raises an exception

The implementation above re-adds failed keys to the dirty set, ensuring they’re retried on the next cycle. Add exponential backoff if the database is persistently unavailable.

Ordering guarantees

Write-behind does not guarantee ordering. If key A is updated at T1 and T2, and two concurrent flushers grab each update, the database could receive T2 before T1. For ordered writes, use a single-threaded flusher or append a sequence number that the flush function respects.

Monitoring

Track these metrics in production:

Dirty set size — if it grows unboundedly, flushes aren’t keeping up.
Flush duration — rising flush times indicate database strain.
Error count — repeated flush failures need alerts.
Data age — the max time between a write and its flush measures your data-loss window.

# Expose via Prometheus
from prometheus_client import Gauge, Counter

dirty_gauge = Gauge("wbc_dirty_entries", "Pending dirty cache entries")
flush_counter = Counter("wbc_flushes_total", "Total flush operations")
flush_errors = Counter("wbc_flush_errors_total", "Failed flush operations")

Comparison with write-through

Write-through guarantees zero data loss but adds latency to every write. Write-behind minimizes write latency but accepts a data-loss window. In practice, many systems use both: write-through for critical data (payments, user accounts) and write-behind for high-volume, lower-stakes data (analytics, activity feeds).

The one thing to remember: write-behind caching is a powerful tool for absorbing write bursts, but it demands rigorous flush failure handling and monitoring to avoid silent data loss in production.

pythoncachingdata-patterns