Pandas I/O Optimization — Deep Dive
Technical foundation
Pandas I/O is backed by different engines depending on the format:
- CSV: C parser (default, fast) or Python parser (flexible, slower)
- Parquet: PyArrow (recommended) or fastparquet
- Feather: PyArrow
- HDF5: PyTables
- Excel: openpyxl (xlsx) or xlrd (xls)
The choice of engine, combined with format-specific parameters, determines read/write performance by orders of magnitude.
CSV optimization deep dive
The C parser vs Python parser
# Default: C parser (fast)
df = pd.read_csv("data.csv", engine="c")
# Python parser: needed for certain edge cases
# - sep longer than 1 character (and not \s+)
# - skipfooter parameter
# - certain encoding edge cases
df = pd.read_csv("data.csv", engine="python", sep="::")
The C parser is typically 5-10x faster. Use it whenever possible.
Specifying dtypes
dtypes = {
"user_id": "int32",
"age": "int8",
"revenue": "float32",
"country": "category",
"status": "category",
"is_active": "bool",
}
df = pd.read_csv(
"users.csv",
dtype=dtypes,
usecols=list(dtypes.keys()), # Only read needed columns
parse_dates=["created_at", "last_login"],
)
Memory impact of dtype specification
# Without dtype specification (default)
df_default = pd.read_csv("large_file.csv")
print(f"Default: {df_default.memory_usage(deep=True).sum() / 1e6:.1f} MB")
# With optimized dtypes
df_optimized = pd.read_csv("large_file.csv", dtype=dtypes)
print(f"Optimized: {df_optimized.memory_usage(deep=True).sum() / 1e6:.1f} MB")
# Typical result: 3-8x memory reduction
Chunked reading for large files
# Process file in chunks when it doesn't fit in memory
chunks = []
for chunk in pd.read_csv("huge_file.csv", chunksize=100_000, dtype=dtypes):
# Process each chunk
filtered = chunk[chunk["revenue"] > 0]
chunks.append(filtered)
result = pd.concat(chunks, ignore_index=True)
For aggregations, process each chunk and combine results:
total_revenue = 0
row_count = 0
for chunk in pd.read_csv("huge_file.csv", chunksize=500_000):
total_revenue += chunk["revenue"].sum()
row_count += len(chunk)
average_revenue = total_revenue / row_count
CSV write optimization
# Fastest CSV write
df.to_csv("output.csv", index=False)
# Compressed CSV (slower write, smaller file)
df.to_csv("output.csv.gz", index=False, compression="gzip")
# Float precision control (reduces file size for floats)
df.to_csv("output.csv", index=False, float_format="%.4f")
Parquet optimization
Column pruning
# Read only needed columns — skips decompressing unused columns
df = pd.read_parquet("data.parquet", columns=["user_id", "revenue", "date"])
This is Parquet’s killer feature. A file with 200 columns doesn’t slow down when you only need 3 — the other 197 are never touched.
Row group filtering
# Filter rows during read (pushdown predicate)
import pyarrow.parquet as pq
table = pq.read_table(
"data.parquet",
filters=[
("year", "=", 2024),
("revenue", ">", 1000)
],
columns=["user_id", "revenue"]
)
df = table.to_pandas()
With pushdown predicates, Parquet skips entire row groups that don’t match the filter, based on min/max statistics stored in the file metadata.
Compression choices
# Snappy (default): fast compression/decompression, moderate ratio
df.to_parquet("data.snappy.parquet", compression="snappy")
# Gzip: slower but smaller files (good for archival)
df.to_parquet("data.gzip.parquet", compression="gzip")
# Zstd: good balance of speed and compression
df.to_parquet("data.zstd.parquet", compression="zstd")
# No compression: fastest I/O, largest files
df.to_parquet("data.none.parquet", compression=None)
Partitioned Parquet
# Write partitioned by column values
df.to_parquet("data_dir/", partition_cols=["year", "month"])
# Creates: data_dir/year=2024/month=1/part-0.parquet, etc.
# Read with automatic partition filtering
df = pd.read_parquet("data_dir/", filters=[("year", "=", 2024)])
# Only reads 2024 partitions — ignores all other year directories
Partitioning is essential for datasets that grow over time. New data goes into new partitions without rewriting existing files.
Feather for speed
# Write — extremely fast
df.to_feather("data.feather")
# Read — extremely fast
df = pd.read_feather("data.feather")
# With column selection
df = pd.read_feather("data.feather", columns=["id", "value"])
Feather is the fastest format for full-DataFrame reads because it uses Arrow’s in-memory format directly. The tradeoff: less compression than Parquet, so files are larger on disk.
Benchmark comparison
Typical benchmarks for a 1GB CSV file (5 million rows, 50 columns):
| Operation | CSV | Parquet (snappy) | Feather | HDF5 |
|---|---|---|---|---|
| Read (all columns) | 25s | 2.5s | 1.5s | 3s |
| Read (5 columns) | 20s | 0.5s | 0.8s | 2.5s |
| Write | 30s | 4s | 2s | 5s |
| File size | 1000 MB | 200 MB | 600 MB | 400 MB |
Exact numbers vary by data content, hardware, and compression settings.
PyArrow as backend
Since Pandas 2.0, you can use PyArrow as the compute backend instead of NumPy:
# Read directly into Arrow-backed DataFrame
df = pd.read_parquet("data.parquet", dtype_backend="pyarrow")
# Benefits: native string type (not object), better null handling,
# potentially faster operations on string-heavy data
print(df.dtypes)
# Columns show types like: string[pyarrow], int64[pyarrow]
Automatic dtype optimization
def optimize_dtypes(df):
"""Downcast numeric columns to smallest sufficient type."""
for col in df.select_dtypes(include=["int64"]).columns:
df[col] = pd.to_numeric(df[col], downcast="integer")
for col in df.select_dtypes(include=["float64"]).columns:
df[col] = pd.to_numeric(df[col], downcast="float")
for col in df.select_dtypes(include=["object"]).columns:
n_unique = df[col].nunique()
n_total = len(df)
if n_unique / n_total < 0.5: # Less than 50% unique
df[col] = df[col].astype("category")
return df
# Apply after reading
df = pd.read_csv("data.csv")
df = optimize_dtypes(df)
# Then save optimized version
df.to_parquet("data.parquet")
Real-world pipeline pattern
from pathlib import Path
RAW_DIR = Path("data/raw")
PROCESSED_DIR = Path("data/processed")
def ingest_csv_to_parquet(csv_path, parquet_path, dtypes=None):
"""One-time conversion from CSV to optimized Parquet."""
df = pd.read_csv(csv_path, dtype=dtypes)
df = optimize_dtypes(df)
df.to_parquet(parquet_path, compression="zstd", index=False)
csv_size = csv_path.stat().st_size / 1e6
parquet_size = parquet_path.stat().st_size / 1e6
print(f"Converted: {csv_size:.1f} MB CSV → {parquet_size:.1f} MB Parquet "
f"({parquet_size/csv_size:.1%})")
# Convert once
ingest_csv_to_parquet(
RAW_DIR / "sales.csv",
PROCESSED_DIR / "sales.parquet"
)
# Read fast from Parquet for all subsequent analysis
df = pd.read_parquet(PROCESSED_DIR / "sales.parquet", columns=["region", "revenue"])
Edge cases
Excel files: read_excel is slow for large files. If possible, save as CSV first or use openpyxl in read-only mode. For files over 100K rows, almost always convert to Parquet.
JSON: read_json with lines=True for JSONL format is much faster than nested JSON. For large JSON files, use chunksize parameter.
SQL databases: read_sql with chunksize prevents loading entire tables into memory. Use server-side filtering (WHERE clauses) instead of loading everything and filtering in Pandas.
Remote files: Pandas supports reading from S3 (s3://), GCS (gs://), and HTTP URLs directly. For repeated access, download to local Parquet first.
One thing to remember: The I/O optimization hierarchy is: (1) choose the right format (Parquet over CSV), (2) read only needed columns, (3) specify dtypes, (4) use chunked processing for files that don’t fit in memory. Each step compounds — together they can turn a 2-minute load into a 1-second load.
See Also
- Python Bokeh Get an intuitive feel for Bokeh so Python behavior stops feeling unpredictable.
- Python Numpy Advanced Indexing How to cherry-pick exactly the data you want from a NumPy array using lists, masks, and fancy tricks.
- Python Numpy Broadcasting Rules How NumPy magically makes different-sized arrays work together without you writing any loops.
- Python Numpy Einsum One tiny function that replaces dozens of NumPy operations — once you learn its shorthand, array math becomes a breeze.
- Python Numpy Fft Spectral How NumPy breaks apart a signal into its hidden frequencies — like separating a chord into individual notes.