Pandas Categorical Data — Deep Dive
Technical foundation
The Pandas CategoricalDtype stores data as two arrays: a categories index (the unique values) and a codes array (integer indices into the categories). The codes array uses the smallest integer type that can represent all category indices — int8 for up to 127 categories, int16 for up to 32,767, and so on. The value -1 represents missing data (NaN).
This encoding is similar to dictionary encoding in columnar storage formats like Parquet and Apache Arrow.
Creating categorical columns
Automatic discovery
import pandas as pd
# From existing data — categories inferred automatically
df["status"] = df["status"].astype("category")
print(df["status"].cat.categories)
# Index(['active', 'inactive', 'pending'], dtype='object')
Explicit categories
# Define categories including values not yet in the data
status_type = pd.CategoricalDtype(
categories=["pending", "active", "inactive", "archived"],
ordered=False
)
df["status"] = df["status"].astype(status_type)
# "archived" is a valid category even if no rows have it yet
Ordered categories
size_type = pd.CategoricalDtype(
categories=["XS", "S", "M", "L", "XL", "XXL"],
ordered=True
)
df["size"] = df["size"].astype(size_type)
# Now comparisons work
df[df["size"] > "M"] # L, XL, XXL only
df["size"].min() # XS
df.sort_values("size") # Sorts by defined order, not alphabetically
The .cat accessor
Categorical columns expose a .cat accessor with specialized methods:
# View internals
df["color"].cat.categories # The unique values
df["color"].cat.codes # Integer codes per row
df["color"].cat.ordered # True if ordered
# Modify categories
df["color"] = df["color"].cat.rename_categories({"red": "crimson"})
df["color"] = df["color"].cat.add_categories(["purple", "teal"])
df["color"] = df["color"].cat.remove_unused_categories()
# Reorder (for ordered categoricals)
df["size"] = df["size"].cat.reorder_categories(["S", "M", "L", "XL"])
df["size"] = df["size"].cat.set_categories(["S", "M", "L"], ordered=True)
Memory benchmarks
import numpy as np
# Generate test data
n_rows = 5_000_000
categories = [f"category_{i}" for i in range(20)]
data = np.random.choice(categories, size=n_rows)
# String column
df_str = pd.DataFrame({"col": data})
print(f"String: {df_str.memory_usage(deep=True).sum() / 1e6:.1f} MB")
# String: ~330 MB
# Categorical column
df_cat = pd.DataFrame({"col": pd.Categorical(data)})
print(f"Categorical: {df_cat.memory_usage(deep=True).sum() / 1e6:.1f} MB")
# Categorical: ~5.8 MB (57x reduction)
The savings depend on two factors: the number of rows (more rows = more savings) and the average string length (longer strings = more savings per row).
Groupby performance with categoricals
import time
n = 2_000_000
df = pd.DataFrame({
"group_str": np.random.choice(["alpha", "beta", "gamma", "delta"], n),
"value": np.random.randn(n)
})
df["group_cat"] = df["group_str"].astype("category")
# String groupby
start = time.perf_counter()
df.groupby("group_str")["value"].mean()
str_time = time.perf_counter() - start
# Categorical groupby
start = time.perf_counter()
df.groupby("group_cat")["value"].mean()
cat_time = time.perf_counter() - start
# Categorical is typically 1.5-3x faster for groupby
observed parameter in groupby
When grouping by a categorical column, Pandas can include categories with no data:
df["rating"] = pd.Categorical(
df["rating"],
categories=[1, 2, 3, 4, 5],
ordered=True
)
# Include all categories, even those with zero rows
counts = df.groupby("rating", observed=False).size()
# Returns: 1→50, 2→120, 3→300, 4→200, 5→0
# Only include categories present in data (default since Pandas 2.2)
counts = df.groupby("rating", observed=True).size()
# Returns: 1→50, 2→120, 3→300, 4→200
Setting observed=False is essential when you need consistent output shape — dashboards, reports, or downstream systems that expect all categories.
Data validation with categoricals
# Define allowed values
valid_statuses = pd.CategoricalDtype(categories=["new", "processing", "shipped", "delivered"])
# This works
df["status"] = pd.Categorical(["new", "shipped", "delivered"], dtype=valid_statuses)
# Invalid values become NaN (not an error by default)
df["status"] = pd.Categorical(["new", "INVALID", "shipped"], dtype=valid_statuses)
# "INVALID" → NaN
# To enforce strict validation, check for NaN after assignment
invalid_mask = df["status"].isna() & original_data.notna()
if invalid_mask.any():
bad_values = original_data[invalid_mask].unique()
raise ValueError(f"Invalid status values: {bad_values}")
Categorical with Parquet and Arrow
Categorical data maps naturally to dictionary encoding in Parquet:
# Save — categorical preserved automatically
df.to_parquet("data.parquet")
# Load — categorical restored automatically
df_loaded = pd.read_parquet("data.parquet")
# Categorical columns remain categorical
When reading CSV files, you can specify categorical columns upfront:
df = pd.read_csv("data.csv", dtype={"region": "category", "status": "category"})
Edge cases and pitfalls
String operations don’t work directly. You can’t call .str.upper() on a categorical column — convert to string first or rename categories:
# Wrong: df["cat_col"].str.upper()
# Right:
df["cat_col"] = df["cat_col"].cat.rename_categories(str.upper)
Merging categorical columns. When merging two DataFrames with categorical columns, the categories must be compatible. If one DataFrame has categories [“A”, “B”] and the other has [“B”, “C”], Pandas will convert to object dtype during the merge. Unify categories beforehand:
all_cats = pd.CategoricalDtype(categories=["A", "B", "C"])
df1["key"] = df1["key"].astype(all_cats)
df2["key"] = df2["key"].astype(all_cats)
Adding new values. You can’t assign a value that isn’t in the categories. Add the category first:
df["status"] = df["status"].cat.add_categories(["cancelled"])
df.loc[mask, "status"] = "cancelled"
Integer categories vs codes. If your categories are integers (like rating 1-5), don’t confuse the category values with the internal codes. Category 5 might have code 4 (zero-indexed).
Decision framework
Is the column mostly unique values? → Don't use categorical
Does the column have < 1% unique values relative to rows? → Use categorical
Do you need ordered comparisons (< > min max)? → Use ordered categorical
Are you reading from Parquet? → Categorical is automatic
Are you doing repeated groupby on this column? → Categorical speeds it up
One thing to remember: Converting high-cardinality string columns (like “country” or “status”) to categorical is one of the easiest performance wins in Pandas — often a single line of code for 10-50x memory reduction and measurably faster groupby.
See Also
- Python Bokeh Get an intuitive feel for Bokeh so Python behavior stops feeling unpredictable.
- Python Numpy Advanced Indexing How to cherry-pick exactly the data you want from a NumPy array using lists, masks, and fancy tricks.
- Python Numpy Broadcasting Rules How NumPy magically makes different-sized arrays work together without you writing any loops.
- Python Numpy Einsum One tiny function that replaces dozens of NumPy operations — once you learn its shorthand, array math becomes a breeze.
- Python Numpy Fft Spectral How NumPy breaks apart a signal into its hidden frequencies — like separating a chord into individual notes.