Convolutional Neural Networks — Deep Dive
The Convolution Operation
For a 2D input $I$ and filter $K$ of size $k \times k$, the convolution at position $(i,j)$ is:
$$(I * K)(i,j) = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} I(i+m, j+n) \cdot K(m,n)$$
In CNNs this is technically cross-correlation (filters aren’t flipped), but the distinction doesn’t matter in practice since filters are learned.
Output dimensions with input $H \times W$, filter $k \times k$, padding $p$, stride $s$:
$$H_{out} = \lfloor \frac{H + 2p - k}{s} \rfloor + 1, \quad W_{out} = \lfloor \frac{W + 2p - k}{s} \rfloor + 1$$
For a conv layer with $C_{in}$ input channels and $C_{out}$ output filters of size $k \times k$:
- Parameters: $C_{out} \times C_{in} \times k \times k + C_{out}$ (weights + biases)
- FLOPs: $2 \times C_{out} \times C_{in} \times k^2 \times H_{out} \times W_{out}$
Receptive Fields
The receptive field of a neuron in layer $l$ is the region of input pixels that can influence its activation. For a single conv layer with kernel size $k$ and stride 1, the receptive field is $k \times k$.
With $L$ stacked layers of kernel size $k$, the theoretical receptive field grows as $1 + L(k-1)$. With stride $s > 1$ or pooling, it grows faster.
This creates a design tension: deep networks needed large receptive fields (to see large objects), but very deep stacks of small kernels were slow to train before ResNet-style skip connections.
Dilated (atrous) convolutions solve this efficiently. A dilation rate $d$ inserts $d-1$ zeros between filter elements, expanding the receptive field without adding parameters:
$$(I *d K)(i,j) = \sum{m} \sum_{n} I(i + dm, j + dn) \cdot K(m,n)$$
With dilation rates ${1, 2, 4, 8}$ across layers, each kernel size 3, the receptive field scales as $1 + \sum 2d(k-1)$. DeepLab (Google’s semantic segmentation architecture) uses dilated convolutions extensively to maintain full-resolution feature maps.
Depthwise Separable Convolutions
Standard convolutions mix spatial filtering and channel mixing simultaneously. Depthwise separable convolutions split these into two steps:
- Depthwise convolution: Apply one $k \times k$ filter per input channel independently (spatial mixing, no channel mixing)
- Pointwise convolution: Apply $1 \times 1$ filters across channels (channel mixing, no spatial mixing)
Parameter reduction: from $C_{out} \times C_{in} \times k^2$ to $C_{in} \times k^2 + C_{out} \times C_{in}$.
For $k=3$, $C_{in} = C_{out} = 256$: standard needs 589,824 params; depthwise separable needs 68,864 params (~8.5x reduction). FLOP savings are similar.
MobileNet (Howard et al., 2017) built entirely on depthwise separable convolutions, enabling state-of-the-art accuracy on mobile devices. MobileNetV2 added inverted residuals and linear bottlenecks. MobileNetV3 used neural architecture search to optimize the design further.
Residual Connections
ResNet (He et al., 2015) introduced the residual block:
$$y = F(x, {W_i}) + x$$
Where $F$ is the residual mapping (typically two conv-BN-ReLU layers) and the $+x$ is the identity skip connection. The key insight: it’s easier to learn $F(x) = H(x) - x$ (the residual) than to directly learn the desired mapping $H(x)$.
In practice, this solves the vanishing gradient problem in very deep networks. The skip connection provides a gradient highway:
$$\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y}\left(\frac{\partial F}{\partial x} + 1\right)$$
The constant 1 ensures gradients can flow even when $\partial F / \partial x \approx 0$.
Bottleneck residual blocks use $1 \times 1$ → $3 \times 3$ → $1 \times 1$ convolution sequences, reducing computation in the middle $3 \times 3$ layer by first projecting to fewer channels.
Dense connections (DenseNet, Huang et al., 2017) extend this: each layer receives feature maps from all preceding layers:
$$x_l = H_l([x_0, x_1, …, x_{l-1}])$$
This improves gradient flow further and encourages feature reuse, achieving strong accuracy with far fewer parameters than ResNet.
Batch Normalization in CNNs
Batch normalization (Ioffe & Szegedy, 2015) is applied after conv, before activation. For each channel, across the spatial positions and batch dimension:
$$\hat{x}{i,j,c} = \frac{x{i,j,c} - \mu_c}{\sqrt{\sigma_c^2 + \epsilon}}$$ $$y_{i,j,c} = \gamma_c \hat{x}_{i,j,c} + \beta_c$$
Where $\mu_c$ and $\sigma_c^2$ are computed over the batch and spatial dimensions for channel $c$. Learnable $\gamma_c$ and $\beta_c$ allow the network to undo normalization if needed.
BatchNorm’s practical effects:
- Allows much higher learning rates (without it, gradients explode or vanish faster)
- Acts as a regularizer (the noise from mini-batch statistics reduces overfitting slightly)
- Reduces sensitivity to initialization
The downside: BatchNorm behaves differently at train vs. inference time (uses running statistics at inference). For very small batches or single-sample inference, Layer Normalization (normalizing across channels for each sample) or Group Normalization is preferred.
Comparison: CNNs vs. Vision Transformers
The Vision Transformer (ViT, Dosovitskiy et al., 2020) showed that pure attention could match CNN performance on ImageNet — given enough data. The comparison:
| Property | CNN | ViT |
|---|---|---|
| Inductive bias | Strong (locality, translation equivariance) | Weak |
| Data efficiency | High — works on small datasets | Low — needs large datasets or pretraining |
| Receptive field | Local, grows with depth | Global from layer 1 |
| Computational scaling | $O(n)$ in spatial resolution | $O(n^2)$ with standard attention |
| Transfer learning | Good | Excellent (very large pretrained models transfer well) |
Hybrid models (e.g., CvT, CoAtNet) use convolutions in early stages (where spatial locality matters) and attention in later stages (where global context matters). These often achieve the best results.
Efficient attention variants (Swin Transformer uses windowed attention; MobileViT mixes local conv with global attention) have reduced ViT’s quadratic complexity, making transformers competitive even on resource-constrained hardware.
Modern Training Techniques
Strong augmentation regimes dramatically affect final accuracy:
- CutMix: Paste patches from one image onto another with mixed labels
- MixUp: Linearly interpolate between two training examples and their labels
- RandAugment: Automatically select augmentation type and magnitude
- AugReg (Touvron et al.): Augmentation + regularization scheduled together
EfficientNetV2 (Tan & Le, 2021) showed that training-aware scaling — adjusting resolution and regularization strength together during training — could outperform purely architecture-based improvements.
Hardware Considerations
Modern GPUs and TPUs have specialized hardware for convolutions:
- cuDNN’s implicit GEMM algorithm converts conv to matrix multiplication, leveraging tensor cores
- Winograd convolution reduces arithmetic for 3×3 kernels at the cost of additional memory
- For mobile: ARM’s NEON intrinsics optimize depthwise separable convolutions specifically
At inference time, quantization (INT8 vs. FP32) and pruning (zeroing small weights) can yield 4–8x speedups with <1% accuracy loss. Tools like TensorRT, TFLite, and ONNX Runtime handle this automatically.
One thing to remember: CNNs dominate wherever local structure matters, but the architectural innovations that made them trainable (residual connections, batch normalization, depthwise separability) are now principles applied across all of deep learning, not just image models.
See Also
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Attention Mechanism The trick that made ChatGPT possible — how AI learned to focus on what actually matters instead of reading everything equally.
- Batch Normalization The 2015 trick that let researchers train much deeper neural networks — why keeping numbers in the right range makes AI learn 10x faster.
- Dropout Regularization How randomly switching off neurons during training makes AI models that generalize better — the counterintuitive trick that stopped neural networks from memorizing everything.
- Generative Adversarial Networks How two AI networks competing against each other created the technology behind deepfakes, AI art, and synthetic data — the forger vs. the detective.