Convolution Operations — Deep Dive

Advanced convolution in Python: FFT-based methods, depthwise and dilated convolutions, im2col optimization, and building custom convolution layers from scratch.

Computational complexity analysis

Direct 1D convolution of a signal of length N with a kernel of length K costs O(NK) multiplications and additions. For 2D convolution of an H×W image with a K×K kernel, the cost is O(HWK²).

When K is large (above ~50 for 1D, or ~11 for 2D), FFT-based convolution wins. The cost becomes O(N log N) regardless of K, because convolution in the time domain equals element-wise multiplication in the frequency domain.

from scipy.signal import fftconvolve, convolve

import numpy as np
signal = np.random.randn(100_000)
kernel = np.random.randn(5_000)

# FFT method — O(N log N)
result_fft = fftconvolve(signal, kernel, mode='same')

# Direct method — O(NK), much slower for large kernels
# result_direct = convolve(signal, kernel, mode='same')

SciPy’s choose_conv_method picks automatically, but explicit choice avoids the overhead of the heuristic for performance-critical code.

The im2col trick

Deep learning frameworks convert 2D convolution into matrix multiplication using the image to column transformation. For each position where the kernel is applied, a column is extracted containing all elements under the kernel. These columns form a large matrix that is multiplied by the flattened kernel:

def im2col(image, kh, kw, stride=1):
    """Convert image patches to columns for matrix-based convolution."""
    h, w = image.shape
    out_h = (h - kh) // stride + 1
    out_w = (w - kw) // stride + 1
    
    cols = np.zeros((kh * kw, out_h * out_w))
    idx = 0
    for i in range(0, h - kh + 1, stride):
        for j in range(0, w - kw + 1, stride):
            patch = image[i:i+kh, j:j+kw].flatten()
            cols[:, idx] = patch
            idx += 1
    return cols

def conv2d_im2col(image, kernel, stride=1):
    """2D convolution via im2col + matrix multiply."""
    kh, kw = kernel.shape
    cols = im2col(image, kh, kw, stride)
    result_flat = kernel.flatten() @ cols
    out_h = (image.shape[0] - kh) // stride + 1
    out_w = (image.shape[1] - kw) // stride + 1
    return result_flat.reshape(out_h, out_w)

This approach lets GPUs use their highly optimized GEMM (General Matrix Multiply) routines. The tradeoff is memory: im2col expands the input by a factor of K² (a 3×3 kernel means 9× memory expansion).

Winograd convolution

For small kernels (3×3 is the most common in CNNs), Winograd’s algorithm reduces the number of multiplications by trading them for cheaper additions. For 3×3 kernels with stride 1, Winograd achieves ~2.25× fewer multiplications compared to direct computation.

Modern deep learning compilers (TVM, TensorRT) automatically apply Winograd when beneficial. In Python, you get this for free when using PyTorch or TensorFlow with cuDNN.

Depthwise separable convolution

Standard convolution with C_in input channels, C_out output channels, and K×K kernel costs O(H × W × K² × C_in × C_out). Depthwise separable convolution splits this into two steps:

Depthwise — one K×K kernel per input channel: O(H × W × K² × C_in)
Pointwise — a 1×1 convolution to mix channels: O(H × W × C_in × C_out)

Total cost reduction: K²/(K² + C_out). For a 3×3 kernel and 256 output channels, that is ~8–9× fewer operations. MobileNet and EfficientNet use this extensively.

import torch
import torch.nn as nn

# Standard convolution: 256 → 256, 3×3
standard = nn.Conv2d(256, 256, 3, padding=1)
# Parameters: 256 * 256 * 3 * 3 = 589,824

# Depthwise separable
depthwise = nn.Conv2d(256, 256, 3, padding=1, groups=256)  # groups=C_in
pointwise = nn.Conv2d(256, 256, 1)
# Parameters: 256 * 3 * 3 + 256 * 256 = 2,304 + 65,536 = 67,840

Dilated (atrous) convolution

Dilated convolution inserts gaps between kernel elements, expanding the receptive field without increasing parameters or reducing resolution:

# Standard 3×3 kernel sees a 3×3 region
conv_standard = nn.Conv2d(64, 64, 3, padding=1, dilation=1)

# Dilated 3×3 with dilation=2 sees a 5×5 region
conv_dilated = nn.Conv2d(64, 64, 3, padding=2, dilation=2)

# Dilated with dilation=4 sees a 9×9 region
conv_wide = nn.Conv2d(64, 64, 3, padding=4, dilation=4)

Stacking dilated convolutions with exponentially increasing dilation (1, 2, 4, 8, 16) gives a receptive field that grows exponentially while parameters grow linearly. WaveNet uses this for audio generation.

Transposed convolution (deconvolution)

Transposed convolution upsamples feature maps — the “reverse” of strided convolution. It is used in:

Image segmentation (U-Net decoder)
Super-resolution
Generative models (GAN generators)

# Upsample 2× using transposed convolution
upsample = nn.ConvTranspose2d(256, 128, kernel_size=4, stride=2, padding=1)
# Input: (batch, 256, 16, 16) → Output: (batch, 128, 32, 32)

Transposed convolution can produce checkerboard artifacts due to uneven overlap. Alternatives: nearest-neighbor upsampling followed by a regular convolution, or sub-pixel convolution (pixel shuffle).

Grouped convolution

Groups partition input and output channels into independent groups, each convolved separately. AlexNet introduced this out of necessity (to split across two GPUs); ResNeXt showed it improves accuracy.

# 4 groups: 256 channels split into 4 groups of 64
grouped = nn.Conv2d(256, 256, 3, padding=1, groups=4)
# Parameters: 4 * (64 * 64 * 3 * 3) = 147,456 (vs 589,824 standard)

Building a minimal CNN from scratch with NumPy

def conv_forward(X, W, b, stride=1, pad=0):
    """
    X: input (N, C_in, H, W)
    W: filters (C_out, C_in, KH, KW)
    b: biases (C_out,)
    """
    N, C_in, H, W_in = X.shape
    C_out, _, KH, KW = W.shape
    
    # Pad input
    X_pad = np.pad(X, ((0,0), (0,0), (pad,pad), (pad,pad)))
    
    H_out = (H + 2*pad - KH) // stride + 1
    W_out = (W_in + 2*pad - KW) // stride + 1
    out = np.zeros((N, C_out, H_out, W_out))
    
    for n in range(N):
        for f in range(C_out):
            for i in range(H_out):
                for j in range(W_out):
                    h_start = i * stride
                    w_start = j * stride
                    receptive_field = X_pad[n, :, h_start:h_start+KH, w_start:w_start+KW]
                    out[n, f, i, j] = np.sum(receptive_field * W[f]) + b[f]
    return out

This is O(N × C_out × H_out × W_out × C_in × KH × KW) — the naive approach. Production code uses im2col + GEMM or Winograd.

Profiling convolution in PyTorch

import torch
from torch.profiler import profile, ProfilerActivity

model = nn.Sequential(
    nn.Conv2d(3, 64, 3, padding=1),
    nn.ReLU(),
    nn.Conv2d(64, 128, 3, stride=2, padding=1),
    nn.ReLU(),
)

x = torch.randn(16, 3, 224, 224, device='cuda')

with profile(activities=[ProfilerActivity.CUDA], record_shapes=True) as prof:
    for _ in range(10):
        model(x)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

This reveals whether cuDNN selected Winograd, FFT, or implicit GEMM for each layer — and whether memory format (NCHW vs NHWC) matters for your hardware.

One thing to remember: Convolution is fundamentally a matrix operation — understanding im2col, Winograd, and FFT-based approaches explains why different kernel sizes and channel counts have such different performance characteristics on real hardware.

pythonmathsignal-processingdeep-learningperformance