Database Indexing Strategies — Deep Dive

Design and analyze database indexes in Python applications using EXPLAIN, covering indexes, expression indexes, and index-only scans.

Defining indexes in Python ORMs

SQLAlchemy index definitions

from sqlalchemy import Index, func
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column

class Base(DeclarativeBase):
    pass

class Order(Base):
    __tablename__ = "orders"

    id: Mapped[int] = mapped_column(primary_key=True)
    customer_id: Mapped[int] = mapped_column(ForeignKey("customers.id"), index=True)
    status: Mapped[str] = mapped_column(String(20))
    created_at: Mapped[datetime] = mapped_column(default=func.now())
    total: Mapped[float] = mapped_column(Numeric(10, 2))

    __table_args__ = (
        # Composite index for common query pattern
        Index("idx_customer_status", "customer_id", "status"),

        # Partial index — only pending orders
        Index(
            "idx_pending_orders",
            "customer_id", "created_at",
            postgresql_where=(status == "pending"),
        ),

        # Expression index — case-insensitive email search
        Index(
            "idx_lower_status",
            func.lower(status),
        ),
    )

Django index definitions

from django.db import models

class Order(models.Model):
    customer = models.ForeignKey(Customer, on_delete=models.CASCADE)
    status = models.CharField(max_length=20)
    created_at = models.DateTimeField(auto_now_add=True)
    total = models.DecimalField(max_digits=10, decimal_places=2)

    class Meta:
        indexes = [
            # Composite index
            models.Index(fields=['customer_id', 'status']),

            # Composite with ordering (descending)
            models.Index(fields=['customer_id', '-created_at']),

            # Partial index
            models.Index(
                fields=['customer_id', 'created_at'],
                condition=models.Q(status='pending'),
                name='idx_pending_orders',
            ),

            # Covering index (PostgreSQL 11+)
            models.Index(
                fields=['customer_id', 'status'],
                include=['total', 'created_at'],
                name='idx_customer_status_covering',
            ),
        ]

Understanding EXPLAIN ANALYZE

The most important tool for indexing decisions is EXPLAIN ANALYZE, which shows the actual execution plan and timing:

Running EXPLAIN from Python

from sqlalchemy import text

with engine.connect() as conn:
    result = conn.execute(text("""
        EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
        SELECT * FROM orders
        WHERE customer_id = 42 AND status = 'pending'
        ORDER BY created_at DESC
        LIMIT 10
    """))
    for row in result:
        print(row[0])

Reading the output

Limit  (cost=0.43..12.86 rows=10 width=64) (actual time=0.031..0.045 rows=10 loops=1)
  ->  Index Scan using idx_customer_status on orders
        (cost=0.43..158.32 rows=127 width=64)
        (actual time=0.029..0.041 rows=10 loops=1)
        Index Cond: ((customer_id = 42) AND (status = 'pending'))
        Buffers: shared hit=5
Planning Time: 0.085 ms
Execution Time: 0.062 ms

Key things to look for:

Index Scan or Index Only Scan — index is being used ✅
Seq Scan — full table scan, likely needs an index ❌
Buffers: shared hit — data found in cache (fast)
actual time — real execution time in milliseconds

Seq Scan warning signs

Seq Scan on orders  (cost=0.00..25000.00 rows=127 width=64)
                     (actual time=150.031..245.612 rows=127 loops=1)
  Filter: ((customer_id = 42) AND (status = 'pending'))
  Rows Removed by Filter: 999873

The database scanned 1 million rows to find 127 matches. An index on (customer_id, status) would eliminate this.

Composite index design

Column ordering strategy

For a composite index (A, B, C), the database can use it for:

WHERE A = ?
WHERE A = ? AND B = ?
WHERE A = ? AND B = ? AND C = ?
WHERE A = ? AND B > ? ORDER BY B
WHERE A = ? ORDER BY B, C

It cannot use it efficiently for:

WHERE B = ? (skips A)
WHERE A = ? AND C = ? (skips B — can only use A portion)
ORDER BY B, C (no leading A filter)

The equality-first, range-last rule

Place equality conditions before range conditions in the index:

# Query: WHERE status = 'pending' AND created_at > '2026-01-01'

# Good: equality column first
Index("idx_good", "status", "created_at")

# Bad: range column first — can't use status for filtering after range scan
Index("idx_bad", "created_at", "status")

Covering indexes for index-only scans

A covering index includes all columns a query needs, allowing the database to answer entirely from the index without touching the table:

-- Query: SELECT total, created_at FROM orders WHERE customer_id = 42
CREATE INDEX idx_covering ON orders (customer_id) INCLUDE (total, created_at);

The INCLUDE columns aren’t searchable, but they’re stored in the index leaf pages. PostgreSQL can satisfy the query with an Index Only Scan — significantly faster for large tables.

# SQLAlchemy
Index("idx_covering", "customer_id",
      postgresql_include=["total", "created_at"])

Advanced index types

Expression indexes

When queries apply functions to columns, regular indexes don’t help:

-- This query can't use a regular index on email
SELECT * FROM users WHERE lower(email) = 'user@example.com';

-- Solution: index the expression
CREATE INDEX idx_lower_email ON users (lower(email));

In SQLAlchemy:

from sqlalchemy import func, Index

Index("idx_lower_email", func.lower(User.email))

JSONB indexes with GIN

For queries inside JSONB columns:

from sqlalchemy.dialects.postgresql import JSONB
from sqlalchemy import Index

class Event(Base):
    __tablename__ = "events"

    id: Mapped[int] = mapped_column(primary_key=True)
    data: Mapped[dict] = mapped_column(JSONB)

    __table_args__ = (
        # GIN index for containment queries (@>, ?, ?|, ?&)
        Index("idx_event_data", "data", postgresql_using="gin"),
    )

This enables fast queries like:

# Find events where data contains {"type": "click"}
stmt = select(Event).where(Event.data.contains({"type": "click"}))

Partial indexes for soft-deletes

Applications using soft-delete patterns (setting deleted_at instead of removing rows) benefit enormously from partial indexes:

# Most queries filter for non-deleted rows
Index(
    "idx_active_orders",
    "customer_id", "created_at",
    postgresql_where=text("deleted_at IS NULL"),
)

This index is smaller, faster to query, and cheaper to maintain because it ignores deleted rows entirely.

Index maintenance

Identifying unused indexes

SELECT
    schemaname, tablename, indexname,
    idx_scan AS times_used,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;

Run this query periodically. Indexes with zero scans since the last statistics reset are candidates for removal.

Identifying missing indexes

SELECT
    relname AS table,
    seq_scan AS sequential_scans,
    seq_tup_read AS rows_scanned,
    idx_scan AS index_scans,
    n_live_tup AS row_count
FROM pg_stat_user_tables
WHERE seq_scan > 100
  AND n_live_tup > 10000
ORDER BY seq_tup_read DESC;

Tables with many sequential scans and large row counts likely need indexes.

Automated index monitoring in Python

from sqlalchemy import text

def check_index_usage(engine, min_size_mb=1):
    with engine.connect() as conn:
        result = conn.execute(text("""
            SELECT indexrelname, idx_scan,
                   pg_size_pretty(pg_relation_size(indexrelid)) as size
            FROM pg_stat_user_indexes
            WHERE pg_relation_size(indexrelid) > :min_bytes
              AND idx_scan = 0
            ORDER BY pg_relation_size(indexrelid) DESC
        """), {"min_bytes": min_size_mb * 1024 * 1024})

        unused = result.fetchall()
        if unused:
            print(f"Found {len(unused)} unused indexes:")
            for idx in unused:
                print(f"  {idx.indexrelname}: {idx.size} (0 scans)")
        return unused

Index bloat and REINDEX

Over time, indexes accumulate dead entries from UPDATEs and DELETEs. PostgreSQL’s autovacuum handles most cleanup, but heavily updated tables may need periodic reindexing:

-- Check index bloat
SELECT
    tablename, indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
    idx_scan
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;

-- Rebuild without locking (PostgreSQL 12+)
REINDEX INDEX CONCURRENTLY idx_orders_customer_id;

Migration considerations

Creating indexes without downtime

Standard CREATE INDEX locks the table for writes. For production databases, always use CONCURRENTLY:

# Alembic migration
from alembic import op

def upgrade():
    op.execute(
        "CREATE INDEX CONCURRENTLY IF NOT EXISTS "
        "idx_orders_customer_status ON orders (customer_id, status)"
    )
    # Note: CONCURRENTLY can't run inside a transaction
    # Add `# no-transaction` to the migration or configure env.py

Dropping unused indexes safely

Before dropping an index, verify it’s truly unused:

Reset statistics: SELECT pg_stat_reset()
Wait one full business cycle (at least a week)
Check idx_scan count
If still zero, safe to drop

One thing to remember: The best indexing strategy comes from your actual queries, not theory. Run EXPLAIN ANALYZE on your slowest queries, check pg_stat_user_indexes for unused indexes, and remember that every index you create speeds up reads but slows down writes. Let the data guide your decisions.

pythondatabasesperformance