Computer Vision — Deep Dive

The technical architecture, training tricks, and unsolved problems behind modern vision systems — from attention mechanisms and ViTs to the adversarial fragility nobody talks about enough.

What Actually Happens During Inference

When an image passes through a CNN, the math is deceptively simple at each step. The complexity comes from depth and scale.

Convolution in Practice

Given an input feature map H × W × C (height × width × channels) and a filter of size k × k × C, the output at position (i, j) is:

output[i, j] = sum over (p, q, c) of: filter[p, q, c] * input[i+p, j+q, c] + bias

With a stride of 1 and “same” padding, the spatial dimensions are preserved; valid padding shrinks them. Modern frameworks (PyTorch, TensorFlow/JAX) implement this as a batched matrix multiplication using im2col transformations, which is why convolutions map so efficiently to GPU tensor cores.

For a typical ResNet-50:

Input: 224×224×3 (RGB)
After first conv (7×7, stride 2): 112×112×64
After max pool: 56×56×64
Through 4 residual stages: 7×7×2048
Global average pool: 1×1×2048
FC head: 2048 → 1000 classes

Total: ~25M parameters, ~4 GFLOPs per image.

Architecture Evolution: The Moments That Actually Mattered

AlexNet (2012) — The Inflection Point

Krizhevsky, Sutskever, and Hinton’s AlexNet won ImageNet 2012 with a 15.3% top-5 error rate versus second place at 26.2%. This wasn’t incremental — it was a discontinuity.

What made it work:

ReLU activations instead of tanh/sigmoid, enabling faster training in deep nets
Dropout regularization to reduce overfitting (0.5 dropout on FC layers)
Data augmentation: random crops, horizontal flips, color jitter
Two-GPU training to fit 60M parameters into the limited memory of 2012 GTX 580s

The architectural choices matter less than the lesson: scale + GPU + data = qualitative improvement.

VGG (2014) — Depth Over Width

Oxford’s VGGNet showed that stacking multiple 3×3 convolutions (rather than single large filters) consistently improved accuracy, and that simply making networks deeper worked. VGG-16 and VGG-19 became the backbone of countless downstream tasks for years because they were simple enough to fine-tune.

Two 3×3 convolutions have the same receptive field as one 5×5, with fewer parameters and one extra non-linearity. That detail gets rediscovered every few years.

ResNet (2015) — Solving Gradient Vanishing at Scale

He et al.’s residual connections solved a core problem: very deep networks were performing worse than shallower ones because gradients vanished during backprop. The solution was embarrassingly simple in hindsight:

# Residual block
def forward(self, x):
    identity = x
    out = self.conv1(x)
    out = self.bn1(out)
    out = self.relu(out)
    out = self.conv2(out)
    out = self.bn2(out)
    out += identity  # <-- the residual connection
    out = self.relu(out)
    return out

The skip connection lets gradients flow directly back through the network, making 50-, 101-, even 152-layer networks trainable. ResNet-152 hit 3.57% top-5 error on ImageNet — better than human-estimated error of ~5%. By one benchmark, machines had surpassed humans at image classification.

(Caveat: “human error” was estimated on a subset of the test set. The claim is valid but narrower than the headline suggests.)

EfficientNet (2019) — Compound Scaling

Google’s EfficientNet paper asked a simple question: if you’re going to scale up a network, how should you balance width, depth, and resolution? The answer was a neural architecture search (NAS)-derived base model combined with a compound scaling coefficient. EfficientNet-B7 achieved 84.3% ImageNet top-1 accuracy with 8x fewer parameters and 6x fewer FLOPs than the best GPipe network at the time.

Vision Transformers: Attention for Images

In 2020, Google Brain published “An Image is Worth 16×16 Words” (ViT). The key idea: stop using convolutions entirely and apply the Transformer architecture from NLP directly to images.

How ViT Works

Patch embedding: split the image into 16×16 pixel patches, flatten each patch, linearly project it to a dimension D (e.g., 768)
Positional encoding: add learned position embeddings so the model knows where each patch came from
Class token: prepend a special [CLS] token whose final state represents the whole image
Transformer encoder: standard multi-head self-attention + MLP blocks, N times
Classification head: MLP on top of the [CLS] output

Self-attention lets every patch attend to every other patch globally — a property convolutions lack. A CNN only sees the local neighborhood at each layer, building up global context gradually through depth.

The catch: ViT needs a lot of data to train from scratch. On ImageNet alone (1.2M images), it underperforms ResNet. Pre-trained on JFT-300M (300M internal Google images), it crushes CNNs.

This led to the general rule: convolutions have a useful inductive bias (local spatial structure) that helps in low-data regimes. Attention is more powerful but more data-hungry.

Hybrid Approaches

DINO, Swin Transformer, and ConvNeXt all explore the middle ground:

Swin Transformer: hierarchical windowed attention (local patches attend locally, cross-window attention via shifted windows)
ConvNeXt: “modernized” CNN design that borrows ideas from ViT without using attention — competitive with Swin at the same compute budget
DINO: self-supervised ViT training that produces features useful for segmentation without any segmentation supervision — a striking emergent property

Object Detection Architectures

Classification says “what.” Detection says “what + where.”

Two-Stage Detectors (R-CNN Family)

A Region Proposal Network (RPN) suggests candidate bounding boxes
Each proposal gets cropped, resized, and passed through a classification head

Faster R-CNN (2015) unified both stages in a single network sharing a backbone. It’s still widely used in applications where accuracy matters more than speed (medical imaging, satellite analysis).

One-Stage Detectors (YOLO, SSD, RetinaNet)

Instead of first proposing regions then classifying them, one-stage detectors predict bounding boxes and class probabilities directly from the full image in a single pass.

YOLO (You Only Look Once) is the poster child. It divides the image into a grid, and each cell predicts boxes and confidences directly. YOLOv8 (2023) achieves ~50+ mAP on COCO at 80+ FPS on a modern GPU — fast enough for real-time video.

The tradeoff: one-stage detectors historically struggled with small objects (now largely closed with feature pyramid networks) and dense, overlapping instances.

Non-Maximum Suppression (NMS)

Every detector produces many overlapping boxes. NMS is the post-processing step that keeps only the highest-confidence box when two boxes have high IoU (Intersection over Union). It sounds simple but has real edge cases — two people standing close together, for instance, can cause false merges. Soft-NMS and learned NMS variants attempt to fix this.

Segmentation: Pixel-Level Understanding

Semantic segmentation assigns a class label to every pixel. Instance segmentation additionally distinguishes which object instance each pixel belongs to.

FCN (Fully Convolutional Networks, 2015) removed the FC layers from classification networks and used upsampling to produce pixel-wise outputs. Every pixel gets a class prediction.

U-Net (2015) added skip connections between corresponding encoder and decoder layers — crucial for high-resolution medical imaging where precise boundaries matter. It became the de facto standard for biomedical segmentation and remains competitive today.

Mask R-CNN (2017) extended Faster R-CNN with a parallel branch that outputs a binary mask for each detected instance. It added RoIAlign (replacing the imprecise RoIPool operation) to fix quantization errors in the spatial correspondence between feature map and original image.

Self-Supervised Learning: The Data Bottleneck Workaround

Labeled data is expensive. Self-supervised learning (SSL) trains on unlabeled images by creating pretext tasks with automatically generated labels.

SIMCLR / MoCo: contrastive learning — augmented versions of the same image should be close in embedding space; random different images should be far apart. The model learns representations without labels.

MAE (Masked Autoencoders, He et al. 2021): randomly mask 75% of image patches. Train the ViT to reconstruct the masked patches. Turns out this learns excellent representations, and 75% masking is significantly better than 50% — the model can’t cheat by interpolating neighbors.

The representations from MAE fine-tune to competitive classification, detection, and segmentation performance with far less labeled data. This is the direction the field is moving: massive SSL pre-training, small labeled fine-tuning.

Adversarial Robustness: The Unsolved Problem

In 2013, Szegedy et al. discovered that adding carefully crafted imperceptible noise to an image — invisible to humans — could cause a classifier to output arbitrary incorrect labels with high confidence. These adversarial examples remain a fundamental unresolved problem.

Why it matters: the perturbations aren’t random. They’re computed by maximizing the model’s loss (via gradient ascent), so they’re targeted. An attacker who knows your model architecture can construct inputs that will reliably fool it.

PGD Attack (Projected Gradient Descent):

x_adv = x.clone()
for _ in range(num_steps):
    x_adv.requires_grad_(True)
    loss = criterion(model(x_adv), y)
    grad = torch.autograd.grad(loss, x_adv)[0]
    x_adv = x_adv.detach() + step_size * grad.sign()
    x_adv = torch.clamp(x_adv, x - epsilon, x + epsilon)  # project to ε-ball
    x_adv = torch.clamp(x_adv, 0, 1)

Adversarial training (exposing the model to adversarial examples during training) is the most effective defense but costs 3–10x compute and typically reduces clean accuracy by several points.

No architecture or training procedure has achieved both high clean accuracy and certified robustness at scale. The adversarial robustness benchmark at RobustBench tracks the state of the art; as of early 2026, the best models achieve roughly 70% robust accuracy on CIFAR-10 under standard white-box attacks — compared to ~95% clean accuracy on the same dataset.

Failure Modes in Production

Most CV papers report a single accuracy number. Production systems care about a different distribution of problems:

Long-tail failure: a model trained on 1,000 categories fails silently on class 1,001. In autonomous driving, this means an object class the training distribution didn’t cover — a child riding a cart, a horse on the road.
Temporal distribution shift: a model trained in summer fails in winter (different lighting, snow on lane markings, shorter days).
Domain gap: a model trained on internet photos fails on thermal infrared or satellite imagery, not because it needs more data but because the pixel distribution is a different beast entirely.
Spurious correlations: a model trained on chest X-rays learned that the metal tag often found in female patients’ clothing correlated with a certain diagnosis — not because of biology, but because of tagging convention at the data collection hospital. It performed worse on males not because it was racist but because it had learned the wrong feature entirely.

Solving these requires out-of-distribution detection, uncertainty quantification, and systematic dataset auditing — none of which are solved problems.

One Thing to Remember

The gap between “achieves state-of-the-art on ImageNet” and “reliable in production” is enormous. Modern vision models are extraordinarily powerful pattern-matchers that can fail catastrophically on inputs 1% outside their training distribution — and they’ll do it confidently, with no idea that anything is wrong. Building trustworthy vision systems means solving the distribution gap, not just improving benchmark numbers.

techaicomputer-visioncnnvision-transformerobject-detectionself-supervisedadversarial-robustness