Convolutional Neural Networks — Core Concepts

The architecture that revolutionized computer vision in 2012 and still powers most image AI today — how filters, pooling, and depth work together.

The Moment Everything Changed

In 2012, a CNN called AlexNet entered the ImageNet competition — an annual contest to classify 1.2 million images into 1,000 categories. Every other entry used traditional computer vision techniques. AlexNet used a deep CNN. It won by a margin so large (top-5 error of 15.3% vs. the second place’s 26.2%) that the computer vision field essentially abandoned its prior approaches overnight.

That moment — sometimes called the ImageNet moment — launched the deep learning era. CNNs are still the foundation of most image-related AI, including medical imaging, autonomous driving, satellite analysis, and face recognition.

The Core Idea: Convolution

A convolution is a mathematical operation where you slide a small matrix (the filter or kernel) over a larger one (the image) and multiply-then-sum at each position. The output is a feature map — a new representation of the image highlighting wherever that filter’s pattern appeared.

A 3×3 filter that detects horizontal edges might look like:

-1 -1 -1
 0  0  0
 1  1  1

When this slides over an image, positions with a bright-top, dark-bottom transition produce high values in the feature map. Positions without that edge pattern produce near-zero.

The key insight: instead of designing filters by hand, CNNs learn the filters from data. Backpropagation adjusts the filter values until they detect features that help with the final task (e.g., classification). The network discovers what patterns matter.

CNN Architecture: Layer by Layer

Convolutional Layers

These apply multiple learned filters to the input. If you have 64 filters on a 224×224 image, you get 64 feature maps — each highlighting different patterns. As you go deeper, filters become more complex: early layers detect edges and textures, middle layers detect parts (eyes, wheels), deep layers detect whole concepts (faces, cars).

Activation Functions

After each convolution, a non-linear activation (almost always ReLU: $\max(0, x)$) is applied. Without non-linearities, stacking multiple layers would be mathematically equivalent to just one layer.

Pooling Layers

After convolution + activation, the feature maps are downsampled using max pooling or average pooling. In 2×2 max pooling with stride 2, each 2×2 region is replaced by its maximum value.

Pooling serves two purposes:

Reduces spatial dimensions (saving computation for deeper layers)
Introduces a degree of translation invariance (a feature detected 1 pixel to the left still produces a strong pooled output)

Fully Connected Layers

After several conv+pool blocks, the feature maps are flattened into a vector and fed into regular fully connected layers. These layers combine the detected features to make the final prediction.

Softmax Output

For classification, the final layer typically uses softmax to produce probability distributions over classes. For a 1,000-class problem like ImageNet, the output is a 1,000-dimensional vector summing to 1.

Key Architectures Timeline

AlexNet (2012): First deep CNN to dominate ImageNet. 8 layers, trained on 2 GPUs. Introduced ReLU and dropout to the field.

VGG (2014): Oxford’s entry — very deep but simple (16–19 layers), using only 3×3 convolutions. Remains widely used as a feature extractor.

ResNet (2015): Microsoft’s entry introduced residual connections (skip connections) that allowed training 152+ layer networks without vanishing gradients. Won ImageNet 2015. Residual blocks are now everywhere.

EfficientNet (2019): Google’s approach of systematically scaling depth, width, and input resolution together. Achieved state-of-the-art accuracy at lower compute cost.

ConvNeXt (2022): Meta’s “modernized” CNN that borrowed design principles from Vision Transformers — competitive with ViT on many benchmarks, suggesting CNNs aren’t obsolete.

Why CNNs Beat Fully Connected Networks for Images

A 224×224 image has ~150,000 pixels. A fully connected network connecting those to even 1,000 hidden units would require 150 million parameters in just the first layer — before learning anything useful.

CNNs use parameter sharing: every position in the image uses the same filter weights. A 3×3 filter has only 9 parameters regardless of image size. This is why CNNs can handle high-resolution images with far fewer parameters while also generalizing better (the filter works anywhere in the image, not just where training examples showed it).

Common Misconception: CNNs Aren’t Just for Images

While they were developed for images, convolutions apply to any data with local spatial or temporal structure:

1D CNNs for audio and time series (WaveNet uses them for speech synthesis)
3D CNNs for video analysis (analyzing spatiotemporal patterns across frames)
Graph CNNs for molecular data and social networks (adapted convolution for non-grid structures)

One thing to remember: CNNs work because nearby pixels (or data points) are related — and parameter sharing lets a small learned filter apply that relationship everywhere at once.

deep-learningcomputer-visioncnnneural-networksimage-recognition