PyTorch Custom Datasets — Core Concepts

Why Custom Datasets Exist

PyTorch ships with built-in datasets like MNIST and CIFAR-10 for learning and benchmarking. But real projects use proprietary data: medical images with DICOM metadata, multi-language text corpora, time-series sensor logs, or video frames with bounding box annotations. No framework can anticipate every format. Instead, PyTorch gives you an interface — torch.utils.data.Dataset — and lets you fill in the specifics.

The Dataset Contract

Every custom dataset implements two methods:

  • __len__ — returns the total number of samples
  • __getitem__ — takes an index and returns one sample (input and label)

That’s it. This simplicity is intentional. It means PyTorch’s DataLoader can handle batching, shuffling, and parallel loading without knowing anything about your data format.

Map-Style vs Iterable-Style

PyTorch supports two dataset patterns:

TypeBase ClassUse When
Map-styleDatasetData fits on disk with random access (images, CSVs, databases)
Iterable-styleIterableDatasetData is streamed (real-time feeds, huge files that don’t support seeking)

Map-style datasets are far more common. They support shuffling natively because any item can be accessed by index. Iterable datasets require special handling for shuffling and multi-worker loading.

How DataLoader Connects

The DataLoader wraps your dataset and handles:

  • Batching: Groups individual samples into tensors of shape (batch_size, …)
  • Shuffling: Randomizes order each epoch for better generalization
  • Parallel loading: Spawns multiple worker processes to load data while the GPU trains
  • Collation: Combines samples of varying sizes using a collate function

A typical setup looks like:

Dataset (your code) → DataLoader (PyTorch) → Training Loop

The DataLoader calls your __getitem__ repeatedly across workers, collates results, and yields batches.

Transforms: The Preprocessing Pipeline

Raw data rarely matches what a model expects. Transforms bridge that gap:

  • Resize images to a fixed dimension
  • Normalize pixel values to zero mean and unit variance
  • Convert text to token IDs
  • Augment data with random crops, flips, or noise

Transforms are typically applied inside __getitem__, making each sample self-contained. For image tasks, torchvision.transforms provides composable operations. For text, you’d use a tokenizer from Hugging Face or a custom function.

Common Misconception

People often load the entire dataset into memory inside __init__. For small datasets this works, but for anything over a few gigabytes, it causes out-of-memory errors. The correct pattern is to store file paths or metadata in __init__ and load individual samples lazily in __getitem__. This way, only one batch worth of data is in memory at any time.

Real-World Patterns

Multi-modal data: A single __getitem__ can return an image tensor, a text embedding, and a numerical label. The collate function handles combining them into aligned batches.

On-the-fly augmentation: Random transforms in __getitem__ mean each epoch sees slightly different data, which improves generalization. This is why augmentation belongs in the dataset, not in a preprocessing script.

Caching: For expensive preprocessing (like tokenization), cache results to disk on first access. Libraries like datasets from Hugging Face use memory-mapped files for this, combining lazy loading with near-instant access.

Performance Tips

  • Set num_workers > 0 in DataLoader (typically 4–8 on modern hardware)
  • Use pin_memory=True when training on GPU — it speeds up CPU-to-GPU transfer
  • Profile your data loading with torch.utils.data.DataLoader timing to ensure the GPU isn’t waiting for data
  • Pre-compute expensive transforms and save results if augmentation isn’t needed

The one thing to remember: A custom dataset is just two methods — length and get-item — but getting them right determines whether your training pipeline is a bottleneck or a smooth conveyor belt.

pythonmachine-learningpytorch

See Also

  • Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
  • Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
  • Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
  • Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.
  • Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'