Spatial Joins Performance — Core Concepts

Understand how R-tree indexes, predicate pushdown, and data partitioning make GeoPandas spatial joins scale from thousands to millions of geometries.

A spatial join matches rows from two geographic datasets based on their spatial relationship — “which points fall inside which polygons?” or “which lines intersect which boundaries?” In Python, GeoPandas provides the primary interface, but understanding the underlying mechanics is essential for performance.

Basic spatial join

import geopandas as gpd

points = gpd.read_file("restaurants.geojson")    # 500K points
polygons = gpd.read_file("neighborhoods.geojson") # 2K polygons

joined = gpd.sjoin(points, polygons, predicate="within")

sjoin returns every point annotated with the attributes of the polygon it falls within. Points that match no polygon are dropped (inner join by default).

How the spatial index works

GeoPandas uses an STRtree (Sort-Tile-Recursive) spatial index from Shapely. The process:

Build index — the right GeoDataFrame’s bounding boxes are loaded into an R-tree structure. This takes O(n log n) time.
Query — for each geometry on the left side, the index finds candidate matches whose bounding boxes overlap. This is O(log n) per query.
Refine — candidates are tested with the exact geometric predicate (within, intersects, contains). This is the expensive part, but only runs on a small fraction of pairs.

Without index: 500K × 2K = 1 billion comparisons
With index:    500K × ~3 candidates avg = 1.5 million comparisons
Speedup:       ~660×

Predicate types

Predicate	Meaning	Use case
`intersects`	Geometries share any space	Default, most inclusive
`within`	Left is completely inside right	Points in polygons
`contains`	Right is completely inside left	Polygons containing points
`crosses`	Geometries cross each other	Roads crossing rivers
`overlaps`	Partial overlap between same-dimension geometries	Overlapping zones

Choose the most restrictive predicate for your use case — it reduces false positives from the bounding-box phase.

Spatial join types

# Inner join (default): only matching rows
inner = gpd.sjoin(points, polygons, how="inner", predicate="within")

# Left join: keep all points, NaN where no match
left = gpd.sjoin(points, polygons, how="left", predicate="within")

# Points matching multiple polygons appear as multiple rows
# Handle with: joined.drop_duplicates(subset="geometry")

Performance factors

Geometry complexity

A polygon with 10 vertices is cheap to test. A polygon with 100,000 vertices (detailed coastline) is expensive. Simplify complex geometries before joining:

polygons["geometry"] = polygons.geometry.simplify(tolerance=50)  # meters

CRS matters

Always ensure both GeoDataFrames use the same CRS. Spatial indexes on unprojected data (EPSG:4326) work but are less efficient than projected coordinates (UTM).

points = points.to_crs(epsg=32618)
polygons = polygons.to_crs(epsg=32618)

Index creation

GeoPandas builds the spatial index lazily on first access. For repeated joins with the same right-side data, the index is reused automatically:

# Force index creation
polygons.sindex  # builds STRtree

Nearest join

When you need “find the closest feature” instead of “find features that overlap”:

# Nearest restaurant to each park
nearest = gpd.sjoin_nearest(
    parks, restaurants, distance_col="dist_m", max_distance=1000
)

sjoin_nearest also uses the spatial index, returning the closest feature within the optional max_distance threshold.

Common misconception

Many people blame GeoPandas for slow spatial joins when the real problem is geometry complexity. A million simple points joining against simplified polygons takes seconds. The same join against unsimplified coastline polygons with millions of vertices can take minutes. Simplify first, join second.

The one thing to remember: Spatial join performance is dominated by two factors — the spatial index eliminates 99% of pair comparisons, and geometry simplification reduces the cost of the remaining 1%.

pythonspatial-joinsgeospatialperformance