h5py for HDF5 Files — ELI5
Imagine you have a filing cabinet. Inside it are drawers, and inside each drawer are folders, and inside each folder are pages full of numbers. You can open any drawer, pull out any folder, and read any page — without dumping the entire cabinet onto the floor.
HDF5 is a file format that works like that digital filing cabinet. Instead of spreading your data across hundreds of separate files, you put everything into one big file with folders inside it. Each folder (called a “group”) can hold datasets — huge tables of numbers — plus labels that describe what the numbers mean.
h5py is the Python library that opens, reads, and writes these HDF5 files.
Why not just use a CSV or a spreadsheet?
- Size. CSV files struggle with millions of rows. HDF5 handles billions of numbers in a single file — satellite images, physics experiments, genomic data.
- Speed. Reading one column from a CSV means scanning the whole file. HDF5 jumps straight to the piece you need, like flipping to a bookmark.
- Organization. A CSV is one flat table. An HDF5 file can hold dozens of related datasets, each with metadata, all organized in a tree of folders.
- Compression. HDF5 can compress data internally, shrinking files without losing precision.
Real example: the Large Hadron Collider at CERN produces about a petabyte of data per second during experiments. That data gets stored in HDF5 files because no other format handles that scale while keeping data organized and accessible.
With h5py, reading a specific piece from a 10 GB file is just as easy as reading a tiny file — you never have to load the whole thing into memory.
The one thing to remember: h5py lets Python use HDF5 files as organized, high-speed filing cabinets for enormous datasets — reading only the piece you need without loading everything.