Pandas I/O Optimization — Core Concepts

Why this matters

Data loading is often the slowest part of a Pandas workflow. A 2GB CSV file might take 60 seconds to load, while the same data in Parquet loads in 3 seconds. Multiply that by every notebook restart, every pipeline run, and every teammate — optimizing I/O pays for itself quickly.

File format comparison

CSV (text-based)

The universal exchange format. Every value is stored as human-readable text. Pandas must parse text into types, handle encoding, and deal with quoting rules. Slow to read and write, large on disk, but works everywhere.

Parquet (columnar binary)

Data stored by column, compressed, with embedded type information. Pandas reads only the columns you need (column pruning) and skips decompression for columns you don’t touch. 5-20x faster than CSV for reads, 3-10x smaller on disk.

Feather (row-based binary)

Designed for fast inter-process data transfer. Extremely fast read/write but less compressed than Parquet. Best for temporary files — saving a DataFrame for another script to pick up.

HDF5

Hierarchical format that supports appending data and partial reads. More complex setup than Parquet but useful for write-heavy workflows where you add data incrementally.

Quick wins for CSV

When you’re stuck with CSV files, these parameters make the biggest difference:

  • usecols — Read only the columns you need. Skipping unused columns saves both time and memory.
  • dtype — Specify column types upfront. Without this, Pandas reads everything as strings first, then infers types — doubling the work.
  • nrows — Read a subset for exploration. Don’t load 10 million rows when 1,000 will do for testing.
  • parse_dates — Tell Pandas which columns are dates during reading, not after.

The dtype trick

Pandas defaults to int64 and float64 for numbers, which uses 8 bytes per value. If your integer column only contains values 0-100, int8 (1 byte) works fine — an 8x memory reduction per column.

Common downcasts:

  • Integers 0-255 → uint8 (1 byte)
  • Small integers → int16 or int32
  • Floats that don’t need full precision → float32
  • Low-cardinality strings → category

Common misconception

“Parquet is only for big data tools like Spark.” Pandas has excellent Parquet support via PyArrow or fastparquet backends. Any DataFrame can be saved as Parquet with a single to_parquet() call. There’s no infrastructure requirement — it’s just a file format.

Decision guide

SituationRecommended format
Sharing with non-technical usersCSV
Intermediate pipeline storageParquet
Temporary between scriptsFeather
Appending data over timeHDF5 or Parquet partitions
Archival storageParquet (compressed)
Data under 10MBCSV is fine

One thing to remember: The single biggest I/O optimization is switching from CSV to Parquet for any data you read more than once. It’s usually a one-line change (to_parquet / read_parquet) for 10-20x speedup.

pythonpandasdata-science

See Also