Pandas I/O Optimization — Deep Dive

Technical foundation

Pandas I/O is backed by different engines depending on the format:

  • CSV: C parser (default, fast) or Python parser (flexible, slower)
  • Parquet: PyArrow (recommended) or fastparquet
  • Feather: PyArrow
  • HDF5: PyTables
  • Excel: openpyxl (xlsx) or xlrd (xls)

The choice of engine, combined with format-specific parameters, determines read/write performance by orders of magnitude.

CSV optimization deep dive

The C parser vs Python parser

# Default: C parser (fast)
df = pd.read_csv("data.csv", engine="c")

# Python parser: needed for certain edge cases
# - sep longer than 1 character (and not \s+)
# - skipfooter parameter
# - certain encoding edge cases
df = pd.read_csv("data.csv", engine="python", sep="::")

The C parser is typically 5-10x faster. Use it whenever possible.

Specifying dtypes

dtypes = {
    "user_id": "int32",
    "age": "int8",
    "revenue": "float32",
    "country": "category",
    "status": "category",
    "is_active": "bool",
}

df = pd.read_csv(
    "users.csv",
    dtype=dtypes,
    usecols=list(dtypes.keys()),  # Only read needed columns
    parse_dates=["created_at", "last_login"],
)

Memory impact of dtype specification

# Without dtype specification (default)
df_default = pd.read_csv("large_file.csv")
print(f"Default: {df_default.memory_usage(deep=True).sum() / 1e6:.1f} MB")

# With optimized dtypes
df_optimized = pd.read_csv("large_file.csv", dtype=dtypes)
print(f"Optimized: {df_optimized.memory_usage(deep=True).sum() / 1e6:.1f} MB")

# Typical result: 3-8x memory reduction

Chunked reading for large files

# Process file in chunks when it doesn't fit in memory
chunks = []
for chunk in pd.read_csv("huge_file.csv", chunksize=100_000, dtype=dtypes):
    # Process each chunk
    filtered = chunk[chunk["revenue"] > 0]
    chunks.append(filtered)

result = pd.concat(chunks, ignore_index=True)

For aggregations, process each chunk and combine results:

total_revenue = 0
row_count = 0
for chunk in pd.read_csv("huge_file.csv", chunksize=500_000):
    total_revenue += chunk["revenue"].sum()
    row_count += len(chunk)

average_revenue = total_revenue / row_count

CSV write optimization

# Fastest CSV write
df.to_csv("output.csv", index=False)

# Compressed CSV (slower write, smaller file)
df.to_csv("output.csv.gz", index=False, compression="gzip")

# Float precision control (reduces file size for floats)
df.to_csv("output.csv", index=False, float_format="%.4f")

Parquet optimization

Column pruning

# Read only needed columns — skips decompressing unused columns
df = pd.read_parquet("data.parquet", columns=["user_id", "revenue", "date"])

This is Parquet’s killer feature. A file with 200 columns doesn’t slow down when you only need 3 — the other 197 are never touched.

Row group filtering

# Filter rows during read (pushdown predicate)
import pyarrow.parquet as pq

table = pq.read_table(
    "data.parquet",
    filters=[
        ("year", "=", 2024),
        ("revenue", ">", 1000)
    ],
    columns=["user_id", "revenue"]
)
df = table.to_pandas()

With pushdown predicates, Parquet skips entire row groups that don’t match the filter, based on min/max statistics stored in the file metadata.

Compression choices

# Snappy (default): fast compression/decompression, moderate ratio
df.to_parquet("data.snappy.parquet", compression="snappy")

# Gzip: slower but smaller files (good for archival)
df.to_parquet("data.gzip.parquet", compression="gzip")

# Zstd: good balance of speed and compression
df.to_parquet("data.zstd.parquet", compression="zstd")

# No compression: fastest I/O, largest files
df.to_parquet("data.none.parquet", compression=None)

Partitioned Parquet

# Write partitioned by column values
df.to_parquet("data_dir/", partition_cols=["year", "month"])
# Creates: data_dir/year=2024/month=1/part-0.parquet, etc.

# Read with automatic partition filtering
df = pd.read_parquet("data_dir/", filters=[("year", "=", 2024)])
# Only reads 2024 partitions — ignores all other year directories

Partitioning is essential for datasets that grow over time. New data goes into new partitions without rewriting existing files.

Feather for speed

# Write — extremely fast
df.to_feather("data.feather")

# Read — extremely fast
df = pd.read_feather("data.feather")

# With column selection
df = pd.read_feather("data.feather", columns=["id", "value"])

Feather is the fastest format for full-DataFrame reads because it uses Arrow’s in-memory format directly. The tradeoff: less compression than Parquet, so files are larger on disk.

Benchmark comparison

Typical benchmarks for a 1GB CSV file (5 million rows, 50 columns):

OperationCSVParquet (snappy)FeatherHDF5
Read (all columns)25s2.5s1.5s3s
Read (5 columns)20s0.5s0.8s2.5s
Write30s4s2s5s
File size1000 MB200 MB600 MB400 MB

Exact numbers vary by data content, hardware, and compression settings.

PyArrow as backend

Since Pandas 2.0, you can use PyArrow as the compute backend instead of NumPy:

# Read directly into Arrow-backed DataFrame
df = pd.read_parquet("data.parquet", dtype_backend="pyarrow")

# Benefits: native string type (not object), better null handling,
# potentially faster operations on string-heavy data
print(df.dtypes)
# Columns show types like: string[pyarrow], int64[pyarrow]

Automatic dtype optimization

def optimize_dtypes(df):
    """Downcast numeric columns to smallest sufficient type."""
    for col in df.select_dtypes(include=["int64"]).columns:
        df[col] = pd.to_numeric(df[col], downcast="integer")

    for col in df.select_dtypes(include=["float64"]).columns:
        df[col] = pd.to_numeric(df[col], downcast="float")

    for col in df.select_dtypes(include=["object"]).columns:
        n_unique = df[col].nunique()
        n_total = len(df)
        if n_unique / n_total < 0.5:  # Less than 50% unique
            df[col] = df[col].astype("category")

    return df

# Apply after reading
df = pd.read_csv("data.csv")
df = optimize_dtypes(df)
# Then save optimized version
df.to_parquet("data.parquet")

Real-world pipeline pattern

from pathlib import Path

RAW_DIR = Path("data/raw")
PROCESSED_DIR = Path("data/processed")

def ingest_csv_to_parquet(csv_path, parquet_path, dtypes=None):
    """One-time conversion from CSV to optimized Parquet."""
    df = pd.read_csv(csv_path, dtype=dtypes)
    df = optimize_dtypes(df)
    df.to_parquet(parquet_path, compression="zstd", index=False)

    csv_size = csv_path.stat().st_size / 1e6
    parquet_size = parquet_path.stat().st_size / 1e6
    print(f"Converted: {csv_size:.1f} MB CSV → {parquet_size:.1f} MB Parquet "
          f"({parquet_size/csv_size:.1%})")

# Convert once
ingest_csv_to_parquet(
    RAW_DIR / "sales.csv",
    PROCESSED_DIR / "sales.parquet"
)

# Read fast from Parquet for all subsequent analysis
df = pd.read_parquet(PROCESSED_DIR / "sales.parquet", columns=["region", "revenue"])

Edge cases

Excel files: read_excel is slow for large files. If possible, save as CSV first or use openpyxl in read-only mode. For files over 100K rows, almost always convert to Parquet.

JSON: read_json with lines=True for JSONL format is much faster than nested JSON. For large JSON files, use chunksize parameter.

SQL databases: read_sql with chunksize prevents loading entire tables into memory. Use server-side filtering (WHERE clauses) instead of loading everything and filtering in Pandas.

Remote files: Pandas supports reading from S3 (s3://), GCS (gs://), and HTTP URLs directly. For repeated access, download to local Parquet first.

One thing to remember: The I/O optimization hierarchy is: (1) choose the right format (Parquet over CSV), (2) read only needed columns, (3) specify dtypes, (4) use chunked processing for files that don’t fit in memory. Each step compounds — together they can turn a 2-minute load into a 1-second load.

pythonpandasdata-science

See Also