Database Indexing Strategies — Deep Dive
Defining indexes in Python ORMs
SQLAlchemy index definitions
from sqlalchemy import Index, func
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column
class Base(DeclarativeBase):
pass
class Order(Base):
__tablename__ = "orders"
id: Mapped[int] = mapped_column(primary_key=True)
customer_id: Mapped[int] = mapped_column(ForeignKey("customers.id"), index=True)
status: Mapped[str] = mapped_column(String(20))
created_at: Mapped[datetime] = mapped_column(default=func.now())
total: Mapped[float] = mapped_column(Numeric(10, 2))
__table_args__ = (
# Composite index for common query pattern
Index("idx_customer_status", "customer_id", "status"),
# Partial index — only pending orders
Index(
"idx_pending_orders",
"customer_id", "created_at",
postgresql_where=(status == "pending"),
),
# Expression index — case-insensitive email search
Index(
"idx_lower_status",
func.lower(status),
),
)
Django index definitions
from django.db import models
class Order(models.Model):
customer = models.ForeignKey(Customer, on_delete=models.CASCADE)
status = models.CharField(max_length=20)
created_at = models.DateTimeField(auto_now_add=True)
total = models.DecimalField(max_digits=10, decimal_places=2)
class Meta:
indexes = [
# Composite index
models.Index(fields=['customer_id', 'status']),
# Composite with ordering (descending)
models.Index(fields=['customer_id', '-created_at']),
# Partial index
models.Index(
fields=['customer_id', 'created_at'],
condition=models.Q(status='pending'),
name='idx_pending_orders',
),
# Covering index (PostgreSQL 11+)
models.Index(
fields=['customer_id', 'status'],
include=['total', 'created_at'],
name='idx_customer_status_covering',
),
]
Understanding EXPLAIN ANALYZE
The most important tool for indexing decisions is EXPLAIN ANALYZE, which shows the actual execution plan and timing:
Running EXPLAIN from Python
from sqlalchemy import text
with engine.connect() as conn:
result = conn.execute(text("""
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders
WHERE customer_id = 42 AND status = 'pending'
ORDER BY created_at DESC
LIMIT 10
"""))
for row in result:
print(row[0])
Reading the output
Limit (cost=0.43..12.86 rows=10 width=64) (actual time=0.031..0.045 rows=10 loops=1)
-> Index Scan using idx_customer_status on orders
(cost=0.43..158.32 rows=127 width=64)
(actual time=0.029..0.041 rows=10 loops=1)
Index Cond: ((customer_id = 42) AND (status = 'pending'))
Buffers: shared hit=5
Planning Time: 0.085 ms
Execution Time: 0.062 ms
Key things to look for:
- Index Scan or Index Only Scan — index is being used ✅
- Seq Scan — full table scan, likely needs an index ❌
- Buffers: shared hit — data found in cache (fast)
- actual time — real execution time in milliseconds
Seq Scan warning signs
Seq Scan on orders (cost=0.00..25000.00 rows=127 width=64)
(actual time=150.031..245.612 rows=127 loops=1)
Filter: ((customer_id = 42) AND (status = 'pending'))
Rows Removed by Filter: 999873
The database scanned 1 million rows to find 127 matches. An index on (customer_id, status) would eliminate this.
Composite index design
Column ordering strategy
For a composite index (A, B, C), the database can use it for:
WHERE A = ?WHERE A = ? AND B = ?WHERE A = ? AND B = ? AND C = ?WHERE A = ? AND B > ? ORDER BY BWHERE A = ? ORDER BY B, C
It cannot use it efficiently for:
WHERE B = ?(skips A)WHERE A = ? AND C = ?(skips B — can only use A portion)ORDER BY B, C(no leading A filter)
The equality-first, range-last rule
Place equality conditions before range conditions in the index:
# Query: WHERE status = 'pending' AND created_at > '2026-01-01'
# Good: equality column first
Index("idx_good", "status", "created_at")
# Bad: range column first — can't use status for filtering after range scan
Index("idx_bad", "created_at", "status")
Covering indexes for index-only scans
A covering index includes all columns a query needs, allowing the database to answer entirely from the index without touching the table:
-- Query: SELECT total, created_at FROM orders WHERE customer_id = 42
CREATE INDEX idx_covering ON orders (customer_id) INCLUDE (total, created_at);
The INCLUDE columns aren’t searchable, but they’re stored in the index leaf pages. PostgreSQL can satisfy the query with an Index Only Scan — significantly faster for large tables.
# SQLAlchemy
Index("idx_covering", "customer_id",
postgresql_include=["total", "created_at"])
Advanced index types
Expression indexes
When queries apply functions to columns, regular indexes don’t help:
-- This query can't use a regular index on email
SELECT * FROM users WHERE lower(email) = 'user@example.com';
-- Solution: index the expression
CREATE INDEX idx_lower_email ON users (lower(email));
In SQLAlchemy:
from sqlalchemy import func, Index
Index("idx_lower_email", func.lower(User.email))
JSONB indexes with GIN
For queries inside JSONB columns:
from sqlalchemy.dialects.postgresql import JSONB
from sqlalchemy import Index
class Event(Base):
__tablename__ = "events"
id: Mapped[int] = mapped_column(primary_key=True)
data: Mapped[dict] = mapped_column(JSONB)
__table_args__ = (
# GIN index for containment queries (@>, ?, ?|, ?&)
Index("idx_event_data", "data", postgresql_using="gin"),
)
This enables fast queries like:
# Find events where data contains {"type": "click"}
stmt = select(Event).where(Event.data.contains({"type": "click"}))
Partial indexes for soft-deletes
Applications using soft-delete patterns (setting deleted_at instead of removing rows) benefit enormously from partial indexes:
# Most queries filter for non-deleted rows
Index(
"idx_active_orders",
"customer_id", "created_at",
postgresql_where=text("deleted_at IS NULL"),
)
This index is smaller, faster to query, and cheaper to maintain because it ignores deleted rows entirely.
Index maintenance
Identifying unused indexes
SELECT
schemaname, tablename, indexname,
idx_scan AS times_used,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;
Run this query periodically. Indexes with zero scans since the last statistics reset are candidates for removal.
Identifying missing indexes
SELECT
relname AS table,
seq_scan AS sequential_scans,
seq_tup_read AS rows_scanned,
idx_scan AS index_scans,
n_live_tup AS row_count
FROM pg_stat_user_tables
WHERE seq_scan > 100
AND n_live_tup > 10000
ORDER BY seq_tup_read DESC;
Tables with many sequential scans and large row counts likely need indexes.
Automated index monitoring in Python
from sqlalchemy import text
def check_index_usage(engine, min_size_mb=1):
with engine.connect() as conn:
result = conn.execute(text("""
SELECT indexrelname, idx_scan,
pg_size_pretty(pg_relation_size(indexrelid)) as size
FROM pg_stat_user_indexes
WHERE pg_relation_size(indexrelid) > :min_bytes
AND idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC
"""), {"min_bytes": min_size_mb * 1024 * 1024})
unused = result.fetchall()
if unused:
print(f"Found {len(unused)} unused indexes:")
for idx in unused:
print(f" {idx.indexrelname}: {idx.size} (0 scans)")
return unused
Index bloat and REINDEX
Over time, indexes accumulate dead entries from UPDATEs and DELETEs. PostgreSQL’s autovacuum handles most cleanup, but heavily updated tables may need periodic reindexing:
-- Check index bloat
SELECT
tablename, indexname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_scan
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC;
-- Rebuild without locking (PostgreSQL 12+)
REINDEX INDEX CONCURRENTLY idx_orders_customer_id;
Migration considerations
Creating indexes without downtime
Standard CREATE INDEX locks the table for writes. For production databases, always use CONCURRENTLY:
# Alembic migration
from alembic import op
def upgrade():
op.execute(
"CREATE INDEX CONCURRENTLY IF NOT EXISTS "
"idx_orders_customer_status ON orders (customer_id, status)"
)
# Note: CONCURRENTLY can't run inside a transaction
# Add `# no-transaction` to the migration or configure env.py
Dropping unused indexes safely
Before dropping an index, verify it’s truly unused:
- Reset statistics:
SELECT pg_stat_reset() - Wait one full business cycle (at least a week)
- Check
idx_scancount - If still zero, safe to drop
One thing to remember: The best indexing strategy comes from your actual queries, not theory. Run EXPLAIN ANALYZE on your slowest queries, check pg_stat_user_indexes for unused indexes, and remember that every index you create speeds up reads but slows down writes. Let the data guide your decisions.
See Also
- Python Aioredis Understand Aioredis through a practical analogy so your Python decisions become faster and clearer.
- Python Alembic Understand Alembic through a practical analogy so your Python decisions become faster and clearer.
- Python Asyncpg Database asyncpg is the fastest way for Python to talk to PostgreSQL without making your program sit around waiting.
- Python Asyncpg Understand Asyncpg through a practical analogy so your Python decisions become faster and clearer.
- Python Cassandra Python Understand Cassandra Python through a practical analogy so your Python decisions become faster and clearer.