Data Augmentation in Python — Deep Dive
Image Augmentation with Albumentations
Albumentations is faster than torchvision transforms and PIL-based pipelines because it operates on NumPy arrays and uses OpenCV under the hood.
Basic Pipeline
import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2
transform = A.Compose([
A.HorizontalFlip(p=0.5),
A.RandomBrightnessContrast(brightness_limit=0.2, contrast_limit=0.2, p=0.5),
A.ShiftScaleRotate(shift_limit=0.1, scale_limit=0.15, rotate_limit=15, p=0.5),
A.GaussNoise(var_limit=(10, 50), p=0.3),
A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
ToTensorV2(),
])
image = cv2.imread("photo.jpg")
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
augmented = transform(image=image)["image"]
Advanced Transforms
train_transform = A.Compose([
A.RandomResizedCrop(height=224, width=224, scale=(0.8, 1.0)),
A.OneOf([
A.GaussianBlur(blur_limit=7),
A.MedianBlur(blur_limit=7),
A.MotionBlur(blur_limit=7),
], p=0.3),
A.CoarseDropout(
max_holes=8, max_height=32, max_width=32,
min_holes=1, min_height=8, min_width=8,
fill_value=0, p=0.5,
),
A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1, p=0.5),
A.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
ToTensorV2(),
])
CoarseDropout is the Albumentations equivalent of Cutout/Random Erasing — it masks rectangular patches with zeros, forcing the model to use multiple regions for classification.
Bounding Box and Segmentation Augmentation
For object detection, transforms must also adjust bounding boxes:
transform = A.Compose([
A.HorizontalFlip(p=0.5),
A.RandomResizedCrop(height=512, width=512, scale=(0.5, 1.0)),
], bbox_params=A.BboxParams(format="pascal_voc", label_fields=["class_labels"]))
augmented = transform(image=image, bboxes=bboxes, class_labels=labels)
aug_image = augmented["image"]
aug_bboxes = augmented["bboxes"]
CutMix and MixUp
MixUp
Blends two images and their labels linearly:
import numpy as np
import torch
def mixup(images, labels, alpha=0.2):
lam = np.random.beta(alpha, alpha)
batch_size = images.size(0)
index = torch.randperm(batch_size)
mixed_images = lam * images + (1 - lam) * images[index]
label_a, label_b = labels, labels[index]
return mixed_images, label_a, label_b, lam
# In training loop:
# mixed, y_a, y_b, lam = mixup(images, labels)
# loss = lam * criterion(output, y_a) + (1 - lam) * criterion(output, y_b)
MixUp was introduced by Zhang et al. (2018) and consistently improves generalization on CIFAR-10 and ImageNet by 0.5-1.5 percent accuracy.
CutMix
Replaces a rectangular region of one image with a patch from another:
def cutmix(images, labels, alpha=1.0):
lam = np.random.beta(alpha, alpha)
batch_size = images.size(0)
index = torch.randperm(batch_size)
_, _, H, W = images.shape
cut_ratio = np.sqrt(1 - lam)
cut_h = int(H * cut_ratio)
cut_w = int(W * cut_ratio)
cx = np.random.randint(W)
cy = np.random.randint(H)
x1 = np.clip(cx - cut_w // 2, 0, W)
y1 = np.clip(cy - cut_h // 2, 0, H)
x2 = np.clip(cx + cut_w // 2, 0, W)
y2 = np.clip(cy + cut_h // 2, 0, H)
images[:, :, y1:y2, x1:x2] = images[index, :, y1:y2, x1:x2]
lam = 1 - ((x2 - x1) * (y2 - y1) / (H * W))
return images, labels, labels[index], lam
CutMix forces the model to make predictions from partial information, improving localization and robustness.
AutoAugment and RandAugment
AutoAugment
Google’s AutoAugment uses reinforcement learning to search for the optimal augmentation policy. The learned policies for ImageNet and CIFAR-10 are available as presets:
from torchvision import transforms
auto_augment = transforms.AutoAugment(transforms.AutoAugmentPolicy.IMAGENET)
transform = transforms.Compose([
transforms.RandomResizedCrop(224),
auto_augment,
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
RandAugment
A simpler alternative that randomly applies N transforms at magnitude M, requiring only two hyperparameters:
rand_augment = transforms.RandAugment(num_ops=2, magnitude=9)
RandAugment achieves comparable results to AutoAugment without the expensive search phase.
Text Augmentation with nlpaug
import nlpaug.augmenter.word as naw
# Synonym replacement via WordNet
syn_aug = naw.SynonymAug(aug_src="wordnet", aug_p=0.1)
augmented = syn_aug.augment("The quick brown fox jumps over the lazy dog")
# Contextual word insertion via BERT
bert_aug = naw.ContextualWordEmbsAug(model_path="bert-base-uncased", action="insert", aug_p=0.1)
augmented = bert_aug.augment("Machine learning models need diverse training data")
# Back-translation
import nlpaug.augmenter.word as naw
back_trans = naw.BackTranslationAug(
from_model_name="facebook/wmt19-en-de",
to_model_name="facebook/wmt19-de-en",
)
augmented = back_trans.augment("Data augmentation improves generalization")
For production NLP pipelines, combine multiple augmenters:
import nlpaug.flow as naf
flow = naf.Sometimes([
naw.SynonymAug(aug_src="wordnet", aug_p=0.1),
naw.RandomWordAug(action="swap", aug_p=0.1),
naw.RandomWordAug(action="delete", aug_p=0.05),
], aug_p=0.5) # Apply the whole flow 50% of the time
Tabular Augmentation with SMOTE
For imbalanced classification on structured data:
from imblearn.over_sampling import SMOTE, ADASYN
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# SMOTE creates synthetic minority samples by interpolating between neighbors
smote = SMOTE(sampling_strategy="auto", k_neighbors=5, random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
print(f"Before: {dict(zip(*np.unique(y_train, return_counts=True)))}")
print(f"After: {dict(zip(*np.unique(y_resampled, return_counts=True)))}")
SMOTE Variants
- BorderlineSMOTE: Only generates samples near the decision boundary, which is where the model needs the most help.
- ADASYN: Generates more samples for minority instances that are harder to classify.
- SMOTENC: Handles datasets with both numerical and categorical features.
from imblearn.over_sampling import BorderlineSMOTE
bsmote = BorderlineSMOTE(sampling_strategy=0.5, random_state=42)
X_res, y_res = bsmote.fit_resample(X_train, y_train)
Critical warning: Always apply SMOTE after splitting, never before. Generating synthetic samples before the split leaks information from the test set into training.
Augmentation in PyTorch DataLoaders
Integrate augmentation into the training loop:
from torch.utils.data import Dataset, DataLoader
class AugmentedDataset(Dataset):
def __init__(self, images, labels, transform=None):
self.images = images
self.labels = labels
self.transform = transform
def __len__(self):
return len(self.images)
def __getitem__(self, idx):
image = self.images[idx]
label = self.labels[idx]
if self.transform:
augmented = self.transform(image=image)
image = augmented["image"]
return image, label
train_dataset = AugmentedDataset(train_images, train_labels, transform=train_transform)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=4)
Each call to __getitem__ applies random augmentation, so every epoch sees different variations.
Measuring Augmentation Effectiveness
Track these indicators to know if augmentation is helping:
- Train-validation gap: Augmentation should shrink the gap (reduce overfitting).
- Validation accuracy trend: Should improve or plateau, not decrease.
- Per-class metrics: Check that augmentation helps weak classes without hurting strong ones.
# Compare training curves with and without augmentation
# If the gap between train and val loss shrinks, augmentation is working
# If val loss increases, augmentation may be too aggressive
Tradeoffs
| Method | Domain | Pros | Cons |
|---|---|---|---|
| Geometric transforms | Images | Simple, fast, well-understood | Limited diversity |
| CutMix / MixUp | Images | Strong regularization, easy to implement | Blends can be unnatural |
| AutoAugment | Images | Optimal policy, strong results | Expensive search, domain-specific |
| Synonym / back-translation | Text | Preserves meaning | Can introduce errors |
| SMOTE | Tabular | Addresses class imbalance | Can create unrealistic samples |
| GANs / VAEs | Any | High diversity | Hard to train, mode collapse risk |
One thing to remember: The best augmentation strategy is domain-specific and empirically validated — start simple, measure the impact on validation metrics, and escalate complexity only when the data demands it.
See Also
- Feature Engineering Why the way you describe your data to a machine learning model matters more than which model you choose — the art of turning raw data into something AI can actually learn from.
- Python Feature Engineering Turn raw messy data into clues a computer can actually use to make smart predictions.
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
- Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.