Computer Vision — Core Concepts

How machines learn to interpret images — the core techniques behind everything from facial recognition to cancer detection, without the PhD.

From Pixels to Meaning

A digital image is a matrix of numbers. Each pixel holds a value (or three values for RGB color). There’s no inherent “chair” or “face” in there — just 0–255 values arranged in a grid. Computer vision is the discipline of writing software that extracts meaning from that grid.

For most of computing history, this was done by hand. Engineers wrote explicit rules: “detect edges here, look for circular shapes there, match against templates.” It was painstaking, brittle, and deeply limited.

Then neural networks happened.

The Old Way vs. The New Way

Before 2012, most computer vision pipelines looked like this:

Feature engineering — humans manually designed detectors (edge filters, color histograms, texture descriptors)
Feature extraction — run the image through those detectors
Classification — a simple ML model decides what category the features belong to

The problem: features designed for recognizing cars didn’t work for faces. Features designed for faces didn’t work for X-rays. Every new domain required new hand-crafted features. Teams of engineers spent months on this.

In 2012, AlexNet — a deep convolutional neural network trained on 1.2 million images — entered the ImageNet competition and cut the error rate nearly in half compared to the second-place entry. The gap was so large it looked like a mistake. It wasn’t. The deep learning era of computer vision had started.

How Convolutional Neural Networks See

Most modern vision systems use a Convolutional Neural Network (CNN). Here’s how they actually work.

Filters and Feature Maps

Instead of a human writing “look for vertical edges,” a CNN learns its own detectors. The key operation is convolution: a small grid of weights (called a filter or kernel) slides across the image, multiplying pixel values at each position and summing them up.

Early layers learn to detect simple things: edges, corners, color gradients. Deeper layers combine those into more complex patterns: textures, shapes, object parts. The deepest layers encode abstract concepts like “face” or “car wheel.”

No human programmed these filters. They emerged from training on labeled examples.

Pooling: Seeing the Forest, Not the Trees

After convolution, a “pooling” step shrinks the feature maps — summarizing regions instead of tracking every pixel. This makes the model less sensitive to where exactly something appears. Whether the cat’s eye is at pixel (120, 340) or (122, 338) doesn’t matter.

Fully Connected Layers

At the end of a CNN, flattened features flow into traditional dense neural network layers that output a final classification: “74% cat, 20% dog, 6% other.”

The Main Tasks in Computer Vision

Not everything is “what’s in this photo.” There are several distinct problem types:

Task	What it does	Example
Image Classification	”What is this?"	"This photo contains a cat”
Object Detection	”What’s here, and where?”	Draw boxes around every car in an image
Semantic Segmentation	”Label every pixel”	Color roads blue, buildings gray, sky cyan
Face Recognition	”Whose face is this?”	Unlocking your phone
Pose Estimation	”How is this person standing?”	Fitness apps tracking your squat form
OCR	”What does this text say?”	Scanning receipts, reading license plates

Each of these has different architectures and training approaches, though they all lean on the same CNN backbone ideas.

What Models Actually Need to Learn

The biggest misconception about computer vision: people think the model “sees” the way we do. It doesn’t.

A model trained on ImageNet (a dataset of everyday objects) will be useless on medical images without fine-tuning. It genuinely has no concept of “lung” or “tumor” — because none were in its training data. When you see an AI model achieving “dermatologist-level accuracy” on skin cancer detection, it was trained on tens of thousands of labeled dermoscopy images. It didn’t generalize from dogs and cars.

This specialization is both a feature (models can become extremely accurate in narrow domains) and a limitation (every new domain requires new labeled data, which is expensive to collect).

The Data Problem

Getting labeled data is the biggest bottleneck in real-world computer vision.

For ImageNet, people on Amazon Mechanical Turk spent years manually labeling images — for less than $0.01 per label. The entire 14-million-image dataset reportedly cost around $2.5 million to annotate. Medical imaging is worse: each X-ray annotation requires an actual radiologist, which costs real money and time.

This is why techniques like:

Data augmentation (rotating, flipping, cropping images to create synthetic variety)
Transfer learning (starting from a pre-trained model instead of from scratch)
Self-supervised learning (learning from unlabeled images by predicting masked regions)

…are such active research areas. They all exist to reduce the labeled-data requirement.

Common Misconception: More Accuracy = Production Ready

Benchmark accuracy (on a test dataset) can look great while real-world performance is poor. Why?

Distribution shift: if your training photos were taken in good lighting and your deployment environment is a warehouse with flickering fluorescents, accuracy drops
Adversarial fragility: deliberately crafted inputs can fool models in ways that seem trivially obvious to humans
Rare class performance: a model that’s 98% accurate overall might be 60% accurate on your specific failure mode (the thing you actually care about)

Evaluating computer vision models for production requires much more than a single accuracy number.

One Thing to Remember

A computer vision model doesn’t understand what it sees — it recognizes patterns that correlate with labels it was trained on. That distinction matters enormously when you’re deciding whether to trust it with a medical diagnosis, a self-driving decision, or a bail recommendation.

techaicomputer-visionimage-recognitionconvolutional-neural-networksdeep-learning