Spaced Repetition Algorithms in Python — Deep Dive

Implement SM-2, Leitner, and FSRS from scratch in Python with scheduling math, parameter optimization, and production deployment patterns.

Spaced repetition scheduling is a constrained optimization problem: maximize long-term retention while minimizing total review count. This guide implements three algorithms in Python, covers the underlying mathematics, and addresses production concerns.

SM-2 Implementation

The SM-2 algorithm tracks three values per item:

from dataclasses import dataclass, field
from datetime import datetime, timedelta

@dataclass
class SM2Item:
    easiness: float = 2.5
    interval: int = 0
    repetitions: int = 0
    next_review: datetime = field(default_factory=datetime.now)

def sm2_update(item: SM2Item, quality: int) -> SM2Item:
    """Update scheduling after a review. Quality: 0-5."""
    if quality < 0 or quality > 5:
        raise ValueError("Quality must be between 0 and 5")

    if quality < 3:
        # Failed recall — reset
        item.repetitions = 0
        item.interval = 1
    else:
        if item.repetitions == 0:
            item.interval = 1
        elif item.repetitions == 1:
            item.interval = 6
        else:
            item.interval = round(item.interval * item.easiness)
        item.repetitions += 1

    # Adjust easiness factor
    item.easiness = max(1.3, item.easiness + 0.1 - (5 - quality) * (0.08 + (5 - quality) * 0.02))
    item.next_review = datetime.now() + timedelta(days=item.interval)
    return item

The easiness adjustment formula penalizes difficult recalls (quality 3) more than easy ones (quality 5). The floor of 1.3 prevents intervals from shrinking to unusable levels. In practice, most items converge to an easiness between 1.8 and 3.0 after 5-10 reviews.

Leitner System Implementation

from collections import defaultdict
from datetime import datetime, timedelta

BOX_INTERVALS = {1: 1, 2: 3, 3: 7, 4: 14, 5: 30}

@dataclass
class LeitnerItem:
    box: int = 1
    next_review: datetime = field(default_factory=datetime.now)

def leitner_update(item: LeitnerItem, correct: bool) -> LeitnerItem:
    if correct:
        item.box = min(item.box + 1, max(BOX_INTERVALS.keys()))
    else:
        item.box = 1
    item.interval = BOX_INTERVALS[item.box]
    item.next_review = datetime.now() + timedelta(days=item.interval)
    return item

The Leitner system is useful as a baseline and for applications where simplicity matters more than precision. Educational apps targeting younger audiences sometimes prefer it because the “box” metaphor is easy to explain.

FSRS: Mathematical Foundation

FSRS models memory with four parameters per item: stability (S), difficulty (D), and two auxiliary state variables. The core equation predicts recall probability at time t after the last review:

R(t) = (1 + t / (9 * S))^(-1)

This is a power-law forgetting curve where S controls the decay rate. Higher stability means slower forgetting. When R drops to the desired retention threshold (default 0.9), a review is due.

The interval calculation inverts this equation:

def fsrs_interval(stability: float, desired_retention: float = 0.9) -> float:
    """Calculate days until recall probability drops to desired_retention."""
    return stability * 9 * (1 / desired_retention - 1)

After each review, stability and difficulty update based on the rating (1=Again, 2=Hard, 3=Good, 4=Easy):

import math

@dataclass
class FSRSItem:
    stability: float = 0.4
    difficulty: float = 5.0
    last_review: datetime = field(default_factory=datetime.now)
    reps: int = 0

# Default FSRS v4 parameters (17 weights)
DEFAULT_WEIGHTS = [
    0.4, 0.6, 2.4, 5.8,   # initial stability per rating
    4.93, 0.94, 0.86, 0.01, # difficulty
    1.49, 0.14, 0.94,       # stability after success
    2.18, 0.05, 0.34, 1.26, # stability after failure
    0.29, 2.61              # short-term stability
]

def fsrs_update(item: FSRSItem, rating: int, weights: list = None) -> FSRSItem:
    """Update FSRS item after review. Rating: 1-4."""
    w = weights or DEFAULT_WEIGHTS

    elapsed = (datetime.now() - item.last_review).days
    retrievability = (1 + elapsed / (9 * item.stability)) ** -1

    if item.reps == 0:
        # First review — use initial stability
        item.stability = w[rating - 1]
        item.difficulty = w[4] - (rating - 3) * w[5]
    else:
        # Update difficulty
        new_d = item.difficulty - w[6] * (rating - 3)
        item.difficulty = max(1, min(10, new_d))

        if rating == 1:
            # Failure: stability decreases
            item.stability = w[11] * (item.difficulty ** (-w[12])) * (
                (item.stability + 1) ** w[13] - 1
            ) * math.exp((1 - retrievability) * w[14])
        else:
            # Success: stability increases
            item.stability = item.stability * (
                1 + math.exp(w[8]) * (11 - item.difficulty) * (item.stability ** (-w[9]))
                * (math.exp((1 - retrievability) * w[10]) - 1)
            )

    item.reps += 1
    item.last_review = datetime.now()
    return item

Parameter Optimization

FSRS achieves its accuracy advantage through personalized parameters. Given a user’s review history, you optimize the 17 weights to minimize log-loss on their actual recall outcomes:

import numpy as np
from scipy.optimize import minimize

def optimize_fsrs_weights(reviews: list[dict]) -> list[float]:
    """
    Optimize FSRS weights from review history.
    Each review: {'item_id': str, 'rating': int, 'elapsed_days': float, 'recalled': bool}
    """
    def loss(weights):
        total_loss = 0.0
        items = {}

        for review in reviews:
            item_id = review['item_id']
            if item_id not in items:
                items[item_id] = {'stability': weights[0], 'difficulty': weights[4], 'reps': 0}

            state = items[item_id]
            elapsed = review['elapsed_days']

            # Predicted recall probability
            predicted = (1 + elapsed / (9 * max(state['stability'], 0.01))) ** -1
            predicted = max(0.001, min(0.999, predicted))

            # Binary cross-entropy
            actual = 1.0 if review['recalled'] else 0.0
            total_loss -= actual * np.log(predicted) + (1 - actual) * np.log(1 - predicted)

        return total_loss / len(reviews)

    result = minimize(loss, DEFAULT_WEIGHTS, method='L-BFGS-B',
                     bounds=[(0.01, 100)] * len(DEFAULT_WEIGHTS))
    return result.x.tolist()

In practice, you need at least 400 reviews for optimization to outperform default parameters. The py-fsrs library provides a production-ready implementation with GPU-accelerated optimization using PyTorch.

Scheduling Engine Architecture

A production scheduler needs more than the algorithm itself:

from datetime import datetime
from typing import Protocol

class SchedulerProtocol(Protocol):
    def next_interval(self, item_state: dict, rating: int) -> timedelta: ...
    def items_due(self, user_id: str, limit: int) -> list[dict]: ...

class ReviewScheduler:
    def __init__(self, algorithm: str = "fsrs", db=None):
        self.db = db
        self.algo = self._load_algorithm(algorithm)

    def get_session(self, user_id: str, max_new: int = 20, max_review: int = 200):
        """Build a study session mixing new and review items."""
        due_items = self.db.get_due_items(user_id, limit=max_review)
        new_items = self.db.get_new_items(user_id, limit=max_new)

        # Interleave: every 5th card is new if available
        session = []
        new_idx = 0
        for i, item in enumerate(due_items):
            if i % 5 == 4 and new_idx < len(new_items):
                session.append(new_items[new_idx])
                new_idx += 1
            session.append(item)

        # Append remaining new items
        session.extend(new_items[new_idx:])
        return session

    def process_review(self, user_id: str, item_id: str, rating: int):
        """Record a review and update scheduling."""
        state = self.db.get_item_state(user_id, item_id)
        new_state = self.algo.update(state, rating)
        self.db.save_item_state(user_id, item_id, new_state)
        self.db.record_review_log(user_id, item_id, rating, state, new_state)

Handling Edge Cases

Overdue reviews: When a user skips days, items become overdue. SM-2 ignores the delay and uses the stored interval for the next calculation. FSRS accounts for it by computing actual retrievability at the delayed review time, which produces more accurate stability updates. An item reviewed 10 days late with a correct answer deserves a bigger stability boost than one reviewed on time.

Timezone handling: Store all review timestamps in UTC. Convert to local time only for display and “day boundary” calculations. Users who travel across time zones should not have their schedules disrupted.

Fuzz factors: Anki adds ±2.5% randomness to intervals to prevent “review bunching” where many items come due on the same day. Implement this as a final step after computing the base interval.

import random

def fuzz_interval(interval_days: int) -> int:
    if interval_days < 3:
        return interval_days
    fuzz = max(1, round(interval_days * 0.05))
    return interval_days + random.randint(-fuzz, fuzz)

Benchmarking Algorithms

Compare algorithms by splitting historical reviews into train/test sets. The metric is weighted log-loss on held-out reviews:

def benchmark_retention_prediction(algorithm, test_reviews):
    predictions = []
    actuals = []
    for review in test_reviews:
        predicted_recall = algorithm.predict_recall(review['state'], review['elapsed'])
        predictions.append(predicted_recall)
        actuals.append(1.0 if review['recalled'] else 0.0)

    from sklearn.metrics import log_loss
    return log_loss(actuals, predictions)

Published benchmarks on the open-source Anki dataset (>100 million reviews) show FSRS achieving 30% lower log-loss than SM-2 and 15% lower than earlier SM variants.

Performance at Scale

For applications with millions of users, the scheduling computation itself is negligible — it is pure arithmetic. The bottleneck is database queries for due items. Index on (user_id, next_review_date) and use a covering index to avoid table lookups. A single PostgreSQL instance can serve scheduling queries for 100,000+ concurrent users if indexes are correct.

Batch parameter optimization jobs should run asynchronously. Schedule nightly optimization for users with sufficient review history, store optimized weights per user, and fall back to default weights for new users.

The one thing to remember: Spaced repetition in Python comes down to a tight loop — predict when recall drops below threshold, schedule review just before that point, update the model from the outcome — and the difference between algorithms lies in how accurately that prediction model captures individual learning patterns.

pythonspaced-repetitionlearningeducation-technology