Pandas I/O Optimization — Core Concepts
Why this matters
Data loading is often the slowest part of a Pandas workflow. A 2GB CSV file might take 60 seconds to load, while the same data in Parquet loads in 3 seconds. Multiply that by every notebook restart, every pipeline run, and every teammate — optimizing I/O pays for itself quickly.
File format comparison
CSV (text-based)
The universal exchange format. Every value is stored as human-readable text. Pandas must parse text into types, handle encoding, and deal with quoting rules. Slow to read and write, large on disk, but works everywhere.
Parquet (columnar binary)
Data stored by column, compressed, with embedded type information. Pandas reads only the columns you need (column pruning) and skips decompression for columns you don’t touch. 5-20x faster than CSV for reads, 3-10x smaller on disk.
Feather (row-based binary)
Designed for fast inter-process data transfer. Extremely fast read/write but less compressed than Parquet. Best for temporary files — saving a DataFrame for another script to pick up.
HDF5
Hierarchical format that supports appending data and partial reads. More complex setup than Parquet but useful for write-heavy workflows where you add data incrementally.
Quick wins for CSV
When you’re stuck with CSV files, these parameters make the biggest difference:
usecols— Read only the columns you need. Skipping unused columns saves both time and memory.dtype— Specify column types upfront. Without this, Pandas reads everything as strings first, then infers types — doubling the work.nrows— Read a subset for exploration. Don’t load 10 million rows when 1,000 will do for testing.parse_dates— Tell Pandas which columns are dates during reading, not after.
The dtype trick
Pandas defaults to int64 and float64 for numbers, which uses 8 bytes per value. If your integer column only contains values 0-100, int8 (1 byte) works fine — an 8x memory reduction per column.
Common downcasts:
- Integers 0-255 →
uint8(1 byte) - Small integers →
int16orint32 - Floats that don’t need full precision →
float32 - Low-cardinality strings →
category
Common misconception
“Parquet is only for big data tools like Spark.” Pandas has excellent Parquet support via PyArrow or fastparquet backends. Any DataFrame can be saved as Parquet with a single to_parquet() call. There’s no infrastructure requirement — it’s just a file format.
Decision guide
| Situation | Recommended format |
|---|---|
| Sharing with non-technical users | CSV |
| Intermediate pipeline storage | Parquet |
| Temporary between scripts | Feather |
| Appending data over time | HDF5 or Parquet partitions |
| Archival storage | Parquet (compressed) |
| Data under 10MB | CSV is fine |
One thing to remember: The single biggest I/O optimization is switching from CSV to Parquet for any data you read more than once. It’s usually a one-line change (to_parquet / read_parquet) for 10-20x speedup.
See Also
- Python Bokeh Get an intuitive feel for Bokeh so Python behavior stops feeling unpredictable.
- Python Numpy Advanced Indexing How to cherry-pick exactly the data you want from a NumPy array using lists, masks, and fancy tricks.
- Python Numpy Broadcasting Rules How NumPy magically makes different-sized arrays work together without you writing any loops.
- Python Numpy Einsum One tiny function that replaces dozens of NumPy operations — once you learn its shorthand, array math becomes a breeze.
- Python Numpy Fft Spectral How NumPy breaks apart a signal into its hidden frequencies — like separating a chord into individual notes.