h5py for HDF5 Files — Core Concepts

Learn how h5py organizes, compresses, and provides random access to large scientific datasets stored in HDF5 format.

Why HDF5 matters

Scientific and engineering data often exceeds what simple formats can handle. A single climate simulation might produce 500 GB of output. A microscopy experiment generates 3D image stacks. A machine learning pipeline stores model weights, training logs, and evaluation metrics. These datasets share common needs: large size, internal organization, partial reads, and metadata.

HDF5 (Hierarchical Data Format version 5) was designed at the National Center for Supercomputing Applications (NCSA) specifically for these requirements. It is used by NASA, CERN, the Human Genome Project, and major financial institutions. h5py gives Python direct access to HDF5’s capabilities through a clean, dict-like interface.

Key concepts

Groups — folders in the file

An HDF5 file contains groups organized in a tree, like directories in a filesystem. The root group / is created automatically. You create nested groups to organize related data: /experiment1/raw/, /experiment1/processed/, etc.

Datasets — arrays of numbers

A dataset is an N-dimensional array stored inside a group. It behaves like a NumPy array but lives on disk. You can read slices without loading the full array — critical when the dataset is larger than available RAM.

Datasets have a fixed shape and data type (int32, float64, etc.) set at creation time. Chunked datasets can be resized later.

Attributes — metadata

Both groups and datasets can carry attributes — small pieces of metadata. Store units, descriptions, timestamps, processing parameters, or anything else that helps interpret the data. Attributes are stored alongside the data, so the file is self-describing.

Chunking

By default, HDF5 stores data in contiguous blocks — efficient for full reads but poor for partial access. Chunking splits the dataset into smaller blocks (chunks) that can be read independently. This enables:

Reading a column without scanning every row
Compression per chunk (different chunks can compress at different ratios)
Resizing the dataset by appending new chunks

Compression

HDF5 supports transparent compression using gzip, LZF, or SZIP. Compressed datasets look the same to your code — h5py decompresses on the fly when you read a slice. Compression ratios of 5–10x are common for scientific data, and LZF decompresses fast enough that I/O-bound reads can actually speed up.

A typical workflow

Create — Open a new HDF5 file and define groups and datasets.
Write — Store NumPy arrays as datasets, add attributes for metadata.
Read — Open the file, navigate to the desired dataset, read a slice.
Analyze — Process the slice in memory using NumPy, pandas, or other tools.
Update — Append new data to resizable datasets, update attributes.
Share — Send one self-contained file instead of a folder of CSVs.

HDF5 vs. other formats

Feature	CSV	Parquet	HDF5	NetCDF	Zarr
Hierarchical groups	No	No	Yes	Limited	Yes
N-dimensional arrays	No	No	Yes	Yes	Yes
Partial reads	No	Yes (columns)	Yes (any slice)	Yes	Yes
Compression	External	Built-in	Built-in	Built-in	Built-in
Self-describing	No	Schema only	Full metadata	Full metadata	Full metadata
Cloud-native	N/A	Good	Poor	Poor	Excellent

HDF5 excels at on-disk random access to large arrays with rich metadata. Its weakness is cloud storage — it was designed for POSIX filesystems, not object stores. Zarr is the cloud-native alternative with a similar data model.

Common misconception

People sometimes think HDF5 is a database. It is not — it has no query language, no transactions, and no concurrent write support (by default). It is an array storage format with filesystem-like organization. If you need SQL queries or multi-user writes, use a database. If you need fast, organized storage for large arrays, HDF5 is ideal.

The one thing to remember: h5py combines the simplicity of a Python dictionary interface with the power of HDF5’s hierarchical, compressed, random-access storage — making it the go-to format for large scientific datasets.

pythondata-sciencescience