Pandas Categorical Data — ELI5

Imagine you run a pizza shop and you track what size each customer orders: Small, Medium, or Large. Every single order has one of those three words written next to it.

Now, you have a million orders. That’s the word “Medium” written out hundreds of thousands of times. Every single letter stored separately. That’s wasteful — like writing the full word “pepperoni” on a million receipts instead of just writing “P” and keeping a note that says “P means pepperoni.”

That’s exactly what categorical data does. Instead of storing “Medium” a million times, Pandas stores the number 2 a million times and keeps a tiny dictionary: 1 = Small, 2 = Medium, 3 = Large. Same information, way less space.

But there’s a second superpower. Regular text data has no built-in order. Is “Medium” bigger than “Small”? Your computer doesn’t know — they’re just letters. But with categories, you can tell Pandas “Small comes before Medium comes before Large.” Now Pandas can sort correctly and compare sizes.

Think of it like a filing cabinet with labeled folders. Without categories, every paper gets filed individually and you search through everything. With categories, papers go into pre-made folders, and finding everything in a category is instant.

One thing to remember: Categorical data works best when you have a column that repeats the same few values over and over. The fewer unique values compared to total rows, the bigger the benefit.

pythonpandasdata-science

See Also