Convolutional Neural Networks — Deep Dive

Full technical treatment of CNNs: receptive fields, dilated convolutions, depthwise separable convolutions, batch norm interaction, and the shift toward Vision Transformers.

The Convolution Operation

For a 2D input $I$ and filter $K$ of size $k \times k$, the convolution at position $(i,j)$ is:

$$(I * K)(i,j) = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} I(i+m, j+n) \cdot K(m,n)$$

In CNNs this is technically cross-correlation (filters aren’t flipped), but the distinction doesn’t matter in practice since filters are learned.

Output dimensions with input $H \times W$, filter $k \times k$, padding $p$, stride $s$:

$$H_{out} = \lfloor \frac{H + 2p - k}{s} \rfloor + 1, \quad W_{out} = \lfloor \frac{W + 2p - k}{s} \rfloor + 1$$

For a conv layer with $C_{in}$ input channels and $C_{out}$ output filters of size $k \times k$:

Parameters: $C_{out} \times C_{in} \times k \times k + C_{out}$ (weights + biases)
FLOPs: $2 \times C_{out} \times C_{in} \times k^2 \times H_{out} \times W_{out}$

Receptive Fields

The receptive field of a neuron in layer $l$ is the region of input pixels that can influence its activation. For a single conv layer with kernel size $k$ and stride 1, the receptive field is $k \times k$.

With $L$ stacked layers of kernel size $k$, the theoretical receptive field grows as $1 + L(k-1)$. With stride $s > 1$ or pooling, it grows faster.

This creates a design tension: deep networks needed large receptive fields (to see large objects), but very deep stacks of small kernels were slow to train before ResNet-style skip connections.

Dilated (atrous) convolutions solve this efficiently. A dilation rate $d$ inserts $d-1$ zeros between filter elements, expanding the receptive field without adding parameters:

$$(I *d K)(i,j) = \sum{m} \sum_{n} I(i + dm, j + dn) \cdot K(m,n)$$

With dilation rates ${1, 2, 4, 8}$ across layers, each kernel size 3, the receptive field scales as $1 + \sum 2d(k-1)$. DeepLab (Google’s semantic segmentation architecture) uses dilated convolutions extensively to maintain full-resolution feature maps.

Depthwise Separable Convolutions

Standard convolutions mix spatial filtering and channel mixing simultaneously. Depthwise separable convolutions split these into two steps:

Depthwise convolution: Apply one $k \times k$ filter per input channel independently (spatial mixing, no channel mixing)
Pointwise convolution: Apply $1 \times 1$ filters across channels (channel mixing, no spatial mixing)

Parameter reduction: from $C_{out} \times C_{in} \times k^2$ to $C_{in} \times k^2 + C_{out} \times C_{in}$.

For $k=3$, $C_{in} = C_{out} = 256$: standard needs 589,824 params; depthwise separable needs 68,864 params (~8.5x reduction). FLOP savings are similar.

MobileNet (Howard et al., 2017) built entirely on depthwise separable convolutions, enabling state-of-the-art accuracy on mobile devices. MobileNetV2 added inverted residuals and linear bottlenecks. MobileNetV3 used neural architecture search to optimize the design further.

Residual Connections

ResNet (He et al., 2015) introduced the residual block:

$$y = F(x, {W_i}) + x$$

Where $F$ is the residual mapping (typically two conv-BN-ReLU layers) and the $+x$ is the identity skip connection. The key insight: it’s easier to learn $F(x) = H(x) - x$ (the residual) than to directly learn the desired mapping $H(x)$.

In practice, this solves the vanishing gradient problem in very deep networks. The skip connection provides a gradient highway:

$$\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y}\left(\frac{\partial F}{\partial x} + 1\right)$$

The constant 1 ensures gradients can flow even when $\partial F / \partial x \approx 0$.

Bottleneck residual blocks use $1 \times 1$ → $3 \times 3$ → $1 \times 1$ convolution sequences, reducing computation in the middle $3 \times 3$ layer by first projecting to fewer channels.

Dense connections (DenseNet, Huang et al., 2017) extend this: each layer receives feature maps from all preceding layers:

$$x_l = H_l([x_0, x_1, …, x_{l-1}])$$

This improves gradient flow further and encourages feature reuse, achieving strong accuracy with far fewer parameters than ResNet.

Batch Normalization in CNNs

Batch normalization (Ioffe & Szegedy, 2015) is applied after conv, before activation. For each channel, across the spatial positions and batch dimension:

$$\hat{x}{i,j,c} = \frac{x{i,j,c} - \mu_c}{\sqrt{\sigma_c^2 + \epsilon}}$$ $$y_{i,j,c} = \gamma_c \hat{x}_{i,j,c} + \beta_c$$

Where $\mu_c$ and $\sigma_c^2$ are computed over the batch and spatial dimensions for channel $c$. Learnable $\gamma_c$ and $\beta_c$ allow the network to undo normalization if needed.

BatchNorm’s practical effects:

Allows much higher learning rates (without it, gradients explode or vanish faster)
Acts as a regularizer (the noise from mini-batch statistics reduces overfitting slightly)
Reduces sensitivity to initialization

The downside: BatchNorm behaves differently at train vs. inference time (uses running statistics at inference). For very small batches or single-sample inference, Layer Normalization (normalizing across channels for each sample) or Group Normalization is preferred.

Comparison: CNNs vs. Vision Transformers

The Vision Transformer (ViT, Dosovitskiy et al., 2020) showed that pure attention could match CNN performance on ImageNet — given enough data. The comparison:

Property	CNN	ViT
Inductive bias	Strong (locality, translation equivariance)	Weak
Data efficiency	High — works on small datasets	Low — needs large datasets or pretraining
Receptive field	Local, grows with depth	Global from layer 1
Computational scaling	$O(n)$ in spatial resolution	$O(n^2)$ with standard attention
Transfer learning	Good	Excellent (very large pretrained models transfer well)

Hybrid models (e.g., CvT, CoAtNet) use convolutions in early stages (where spatial locality matters) and attention in later stages (where global context matters). These often achieve the best results.

Efficient attention variants (Swin Transformer uses windowed attention; MobileViT mixes local conv with global attention) have reduced ViT’s quadratic complexity, making transformers competitive even on resource-constrained hardware.

Modern Training Techniques

Strong augmentation regimes dramatically affect final accuracy:

CutMix: Paste patches from one image onto another with mixed labels
MixUp: Linearly interpolate between two training examples and their labels
RandAugment: Automatically select augmentation type and magnitude
AugReg (Touvron et al.): Augmentation + regularization scheduled together

EfficientNetV2 (Tan & Le, 2021) showed that training-aware scaling — adjusting resolution and regularization strength together during training — could outperform purely architecture-based improvements.

Hardware Considerations

Modern GPUs and TPUs have specialized hardware for convolutions:

cuDNN’s implicit GEMM algorithm converts conv to matrix multiplication, leveraging tensor cores
Winograd convolution reduces arithmetic for 3×3 kernels at the cost of additional memory
For mobile: ARM’s NEON intrinsics optimize depthwise separable convolutions specifically

At inference time, quantization (INT8 vs. FP32) and pruning (zeroing small weights) can yield 4–8x speedups with <1% accuracy loss. Tools like TensorRT, TFLite, and ONNX Runtime handle this automatically.

One thing to remember: CNNs dominate wherever local structure matters, but the architectural innovations that made them trainable (residual connections, batch normalization, depthwise separability) are now principles applied across all of deep learning, not just image models.

deep-learningcomputer-visioncnnneural-networksresnetefficientnet