PyTorch Custom Datasets — Core Concepts

Learn to build PyTorch Dataset and DataLoader pipelines that handle real-world data formats efficiently.

Why Custom Datasets Exist

PyTorch ships with built-in datasets like MNIST and CIFAR-10 for learning and benchmarking. But real projects use proprietary data: medical images with DICOM metadata, multi-language text corpora, time-series sensor logs, or video frames with bounding box annotations. No framework can anticipate every format. Instead, PyTorch gives you an interface — torch.utils.data.Dataset — and lets you fill in the specifics.

The Dataset Contract

Every custom dataset implements two methods:

__len__ — returns the total number of samples
__getitem__ — takes an index and returns one sample (input and label)

That’s it. This simplicity is intentional. It means PyTorch’s DataLoader can handle batching, shuffling, and parallel loading without knowing anything about your data format.

Map-Style vs Iterable-Style

PyTorch supports two dataset patterns:

Type	Base Class	Use When
Map-style	`Dataset`	Data fits on disk with random access (images, CSVs, databases)
Iterable-style	`IterableDataset`	Data is streamed (real-time feeds, huge files that don’t support seeking)

Map-style datasets are far more common. They support shuffling natively because any item can be accessed by index. Iterable datasets require special handling for shuffling and multi-worker loading.

How DataLoader Connects

The DataLoader wraps your dataset and handles:

Batching: Groups individual samples into tensors of shape (batch_size, …)
Shuffling: Randomizes order each epoch for better generalization
Parallel loading: Spawns multiple worker processes to load data while the GPU trains
Collation: Combines samples of varying sizes using a collate function

A typical setup looks like:

Dataset (your code) → DataLoader (PyTorch) → Training Loop

The DataLoader calls your __getitem__ repeatedly across workers, collates results, and yields batches.

Transforms: The Preprocessing Pipeline

Raw data rarely matches what a model expects. Transforms bridge that gap:

Resize images to a fixed dimension
Normalize pixel values to zero mean and unit variance
Convert text to token IDs
Augment data with random crops, flips, or noise

Transforms are typically applied inside __getitem__, making each sample self-contained. For image tasks, torchvision.transforms provides composable operations. For text, you’d use a tokenizer from Hugging Face or a custom function.

Common Misconception

People often load the entire dataset into memory inside __init__. For small datasets this works, but for anything over a few gigabytes, it causes out-of-memory errors. The correct pattern is to store file paths or metadata in __init__ and load individual samples lazily in __getitem__. This way, only one batch worth of data is in memory at any time.

Real-World Patterns

Multi-modal data: A single __getitem__ can return an image tensor, a text embedding, and a numerical label. The collate function handles combining them into aligned batches.

On-the-fly augmentation: Random transforms in __getitem__ mean each epoch sees slightly different data, which improves generalization. This is why augmentation belongs in the dataset, not in a preprocessing script.

Caching: For expensive preprocessing (like tokenization), cache results to disk on first access. Libraries like datasets from Hugging Face use memory-mapped files for this, combining lazy loading with near-instant access.

Performance Tips

Set num_workers > 0 in DataLoader (typically 4–8 on modern hardware)
Use pin_memory=True when training on GPU — it speeds up CPU-to-GPU transfer
Profile your data loading with torch.utils.data.DataLoader timing to ensure the GPU isn’t waiting for data
Pre-compute expensive transforms and save results if augmentation isn’t needed

The one thing to remember: A custom dataset is just two methods — length and get-item — but getting them right determines whether your training pipeline is a bottleneck or a smooth conveyor belt.

pythonmachine-learningpytorch