Pandas Merge & Join Strategies — ELI5
Imagine you have two lists. One list has all your friends’ names and their favorite colors. The other list has your friends’ names and their birthdays. You want to make one big list that has names, colors, AND birthdays together.
That’s a merge — you match rows from two lists using something they share (the name).
But here’s where it gets interesting. What if one list has your friend Alex, but the other list doesn’t? You have three choices:
Keep only matches: If Alex isn’t on both lists, leave Alex out entirely. You only get people who appear in both places.
Keep everyone from the first list: Alex stays, but their birthday column says “unknown” since that info wasn’t on the second list.
Keep absolutely everyone: Anyone from either list shows up, with blanks wherever information is missing.
There’s one more tricky situation. What if your friend Sam appears twice on the birthday list — maybe they celebrate both their real birthday and their adoption day. When you merge, Sam’s row from the color list gets copied to match BOTH birthday rows. Now you have two Sam rows instead of one. Your list just got bigger.
This row multiplication surprises people the most. You expected the same number of rows, but the merge created more because of duplicate matches.
One thing to remember: Merging can make your data bigger (from duplicates), smaller (from unmatched rows), or stay the same — it all depends on how the matching keys line up between your two tables.
See Also
- Python Bokeh Get an intuitive feel for Bokeh so Python behavior stops feeling unpredictable.
- Python Numpy Advanced Indexing How to cherry-pick exactly the data you want from a NumPy array using lists, masks, and fancy tricks.
- Python Numpy Broadcasting Rules How NumPy magically makes different-sized arrays work together without you writing any loops.
- Python Numpy Einsum One tiny function that replaces dozens of NumPy operations — once you learn its shorthand, array math becomes a breeze.
- Python Numpy Fft Spectral How NumPy breaks apart a signal into its hidden frequencies — like separating a chord into individual notes.