Dynamic Programming — Deep Dive

Advanced DP techniques in Python: bitmask DP, interval DP, state compression, and optimization strategies for competitive and production code.

Technical framing

Dynamic programming reduces exponential-time problems to polynomial time by exploiting overlapping subproblems. The art lies in state design: choosing the minimal representation that captures all decision-relevant information. A poorly chosen state leads to unnecessary dimensions and prohibitive memory usage. A well-chosen state collapses the problem into something tractable.

This deep dive covers advanced DP patterns that go beyond textbook examples — the techniques that separate fluent problem-solvers from beginners.

State design principles

The state must satisfy the Markov property

The optimal decision at any state must depend only on the current state, not on how you arrived there. If your recurrence needs to “remember” the path, your state is missing a dimension.

Minimize state dimensions

Every dimension multiplies the table size. For a 2D problem with dimensions m × n, adding a third dimension k creates m × n × k states. Ask: “Can I fold this information into fewer dimensions?”

Common state patterns

Pattern	State representation	Example
Linear	`dp[i]` = answer for prefix of length i	Longest increasing subsequence
Two-sequence	`dp[i][j]` = answer for first i of seq1, first j of seq2	Edit distance
Knapsack	`dp[i][w]` = answer using first i items with capacity w	0/1 knapsack
Interval	`dp[i][j]` = answer for subarray from i to j	Matrix chain multiplication
Bitmask	`dp[mask]` = answer for subset represented by bitmask	Traveling salesman
Tree	`dp[node][state]` = answer for subtree rooted at node	Tree coloring

Advanced patterns

Edit distance (Levenshtein)

The foundation of spell checkers, DNA alignment, and fuzzy matching:

def edit_distance(s1: str, s2: str) -> int:
    m, n = len(s1), len(s2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(m + 1):
        dp[i][0] = i
    for j in range(n + 1):
        dp[0][j] = j
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i - 1] == s2[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                dp[i][j] = 1 + min(
                    dp[i - 1][j],      # delete
                    dp[i][j - 1],      # insert
                    dp[i - 1][j - 1]   # replace
                )
    return dp[m][n]

Space-optimized to O(n) since each row only depends on the previous row:

def edit_distance_optimized(s1: str, s2: str) -> int:
    m, n = len(s1), len(s2)
    prev = list(range(n + 1))
    
    for i in range(1, m + 1):
        curr = [i] + [0] * n
        for j in range(1, n + 1):
            if s1[i - 1] == s2[j - 1]:
                curr[j] = prev[j - 1]
            else:
                curr[j] = 1 + min(prev[j], curr[j - 1], prev[j - 1])
        prev = curr
    return prev[n]

Longest increasing subsequence (LIS) with patience sorting

The classic O(n²) DP solution:

def lis_quadratic(arr: list[int]) -> int:
    n = len(arr)
    dp = [1] * n
    for i in range(1, n):
        for j in range(i):
            if arr[j] < arr[i]:
                dp[i] = max(dp[i], dp[j] + 1)
    return max(dp)

The O(n log n) solution using binary search (patience sorting):

import bisect

def lis_nlogn(arr: list[int]) -> int:
    tails = []  # tails[i] = smallest tail of all increasing subsequences of length i+1
    for num in arr:
        pos = bisect.bisect_left(tails, num)
        if pos == len(tails):
            tails.append(num)
        else:
            tails[pos] = num
    return len(tails)

The tails array maintains the invariant that it is always sorted, enabling binary search. Each element is processed in O(log n).

Bitmask DP: Traveling Salesman Problem

For problems involving subsets of N elements (N ≤ 20), represent subsets as bitmasks:

from functools import lru_cache

def tsp(dist: list[list[float]]) -> float:
    n = len(dist)
    ALL_VISITED = (1 << n) - 1
    
    @lru_cache(maxsize=None)
    def solve(mask: int, pos: int) -> float:
        if mask == ALL_VISITED:
            return dist[pos][0]  # return to start
        
        best = float('inf')
        for city in range(n):
            if mask & (1 << city):
                continue
            new_mask = mask | (1 << city)
            cost = dist[pos][city] + solve(new_mask, city)
            best = min(best, cost)
        return best
    
    return solve(1, 0)  # start at city 0, mark it visited

State space: O(2ⁿ × n). This is exponential but dramatically better than the brute-force O(n!) permutation check.

Interval DP: Matrix chain multiplication

Determine the optimal parenthesization to minimize scalar multiplications:

def matrix_chain_order(dimensions: list[int]) -> int:
    n = len(dimensions) - 1  # number of matrices
    dp = [[0] * n for _ in range(n)]
    
    for length in range(2, n + 1):  # chain length
        for i in range(n - length + 1):
            j = i + length - 1
            dp[i][j] = float('inf')
            for k in range(i, j):
                cost = (dp[i][k] + dp[k + 1][j] +
                        dimensions[i] * dimensions[k + 1] * dimensions[j + 1])
                dp[i][j] = min(dp[i][j], cost)
    
    return dp[0][n - 1]

The key insight: iterate by chain length (small intervals before large ones) so that all subproblems are solved before they are needed.

DP on trees

For tree-structured problems, process children before parents (post-order):

def max_independent_set(adj: dict, root: int) -> int:
    """Maximum weight independent set on a tree."""
    dp_include = {}
    dp_exclude = {}
    
    def dfs(node: int, parent: int):
        dp_include[node] = 1  # include this node
        dp_exclude[node] = 0  # exclude this node
        
        for child in adj.get(node, []):
            if child == parent:
                continue
            dfs(child, node)
            dp_include[node] += dp_exclude[child]
            dp_exclude[node] += max(dp_include[child], dp_exclude[child])
    
    dfs(root, -1)
    return max(dp_include[root], dp_exclude[root])

Optimization techniques

Monotonic queue optimization

When the DP transition looks like dp[i] = min(dp[j] + cost(j, i)) for j in a sliding window, a monotonic deque reduces the inner loop from O(k) to O(1) amortized:

from collections import deque

def max_sum_subarray_with_constraint(arr: list[int], k: int) -> int:
    """Max subarray sum where subarray length is at most k."""
    n = len(arr)
    prefix = [0] * (n + 1)
    for i in range(n):
        prefix[i + 1] = prefix[i] + arr[i]
    
    dq = deque([0])
    result = float('-inf')
    
    for i in range(1, n + 1):
        # Remove elements outside window
        while dq and dq[0] < i - k:
            dq.popleft()
        result = max(result, prefix[i] - prefix[dq[0]])
        # Maintain monotonically increasing prefix sums
        while dq and prefix[dq[-1]] >= prefix[i]:
            dq.pop()
        dq.append(i)
    
    return result

Convex hull trick

When the DP transition has the form dp[i] = min(dp[j] + b[j] * a[i]) where a is sorted, the convex hull trick reduces the transition from O(n) to O(1) amortized. This is advanced but crucial for competitive programming problems with 10⁵+ states.

Python-specific considerations

lru_cache vs. manual table

# lru_cache: clean, Pythonic, good for top-down
@lru_cache(maxsize=None)
def dp(i, j):
    ...

# Manual table: faster for bottom-up, controllable memory
dp = [[0] * (n + 1) for _ in range(m + 1)]

For problems with > 10⁶ states, manual tables are significantly faster because lru_cache has per-call overhead from hashing and dictionary lookups.

Recursion limits

Python’s default recursion limit is 1000. For top-down DP on large inputs:

import sys
sys.setrecursionlimit(10**6)  # use with caution

Or convert to iterative bottom-up, which has no depth limit and is generally 2-5x faster in Python.

NumPy acceleration for grid DP

For 2D grid problems, NumPy can parallelize transitions:

import numpy as np

def min_path_sum_numpy(grid):
    dp = np.array(grid, dtype=np.int64)
    dp[0] = np.cumsum(dp[0])
    dp[:, 0] = np.cumsum(dp[:, 0])
    for i in range(1, dp.shape[0]):
        for j in range(1, dp.shape[1]):
            dp[i, j] += min(dp[i-1, j], dp[i, j-1])
    return int(dp[-1, -1])

For truly large grids, the row-by-row computation can use vectorized operations on each row.

Debugging DP

Print the table — visualize the DP table for small inputs. Most bugs become obvious.
Verify base cases — wrong base cases propagate errors through the entire table.
Check transition direction — bottom-up must process dependencies before dependents.
Test against brute force — for small inputs, verify DP results against exhaustive search.

def debug_dp_table(dp, row_labels=None, col_labels=None):
    import pandas as pd
    df = pd.DataFrame(dp, index=row_labels, columns=col_labels)
    print(df.to_string())

One thing to remember: The hardest part of DP is not coding the solution — it is defining the state. Spend 80% of your time on “what does dp[i][j] represent?” and the recurrence and code will follow naturally.

pythonalgorithmsoptimization