Spatial Joins Performance — Deep Dive

Optimize spatial joins at scale: R-tree tuning, Dask-GeoPandas parallelism, DuckDB spatial, prepared geometries, and million-to-million join strategies.

Spatial joins are the most performance-sensitive operation in geospatial Python. Moving from simple sjoin calls to million-to-million geometry matching requires understanding index internals, geometry preparation, partitioning strategies, and alternative engines.

R-tree index internals

GeoPandas delegates spatial indexing to Shapely’s STRtree (Sort-Tile-Recursive tree). The STR algorithm:

Sorts geometries by the x-coordinate of their bounding-box centroid
Packs them into leaf nodes (default fanout ≈ 10)
Recursively builds parent nodes by sorting y-coordinates
Result: a balanced tree where bounding boxes at each level tightly enclose their children

Query mechanics

from shapely import STRtree

tree = STRtree(polygon_geometries)

# Query: which polygons might contain this point?
candidates_idx = tree.query(point, predicate="intersects")
# Returns array of indices into polygon_geometries

The predicate parameter in Shapely 2.0 enables index-level predicate pushdown. With predicate="contains", the STRtree performs both the bounding-box check AND the exact geometry test in C, avoiding Python overhead for each candidate.

Bulk query for join operations

from shapely import STRtree
import numpy as np

tree = STRtree(right_geoms)

# Query all left geometries at once
left_idx, right_idx = tree.query(left_geoms, predicate="intersects")
# Returns two arrays of matching indices

This bulk query is what GeoPandas sjoin uses internally. The entire loop — build tree, query all geometries, return index pairs — runs in C with no per-geometry Python calls.

Prepared geometries

For repeated tests against the same geometry (e.g., checking millions of points against one complex polygon), GEOS “prepares” the geometry by pre-computing internal acceleration structures:

from shapely import prepare

complex_polygon = polygons.geometry.iloc[0]
prepare(complex_polygon)  # pre-computes edge index

# Subsequent operations are 10-100× faster
results = [complex_polygon.contains(p) for p in points]

Shapely 2.0 automatically prepares geometries during bulk operations, but explicit preparation helps when you control the loop.

GeoPandas sjoin optimization checklist

import geopandas as gpd

# 1. Project to meters (tighter bounding boxes, faster index)
left = left.to_crs(epsg=32618)
right = right.to_crs(epsg=32618)

# 2. Simplify complex geometries on the right side
right["geometry"] = right.geometry.simplify(tolerance=10)  # meters

# 3. Use the most restrictive predicate
joined = gpd.sjoin(left, right, predicate="within")  # not "intersects"

# 4. Reduce columns before join (less memory, faster copy)
right_slim = right[["geometry", "region_id", "name"]]
joined = gpd.sjoin(left, right_slim, predicate="within")

Benchmarks: join strategies compared

Testing 1M points against 10K polygons (avg 100 vertices each):

Strategy	Time	Memory	Notes
GeoPandas sjoin (default)	8s	2 GB	STRtree + vectorized
GeoPandas sjoin (unsimplified, 10K vertices/polygon)	180s	2 GB	Geometry test dominates
Shapely bulk query (manual)	5s	1.5 GB	Skip DataFrame overhead
Dask-GeoPandas (4 workers)	3s	4 GB	Parallel partitions
DuckDB Spatial	2s	0.5 GB	Columnar + native
PostGIS (indexed)	1.5s	N/A	Server-side R-tree
H3 pre-bucketing + sjoin	1s	1 GB	Index elimination

Dask-GeoPandas for parallel joins

import dask_geopandas

# Partition left GeoDataFrame spatially
dask_left = dask_geopandas.from_geopandas(left, npartitions=16)
dask_left = dask_left.spatial_shuffle()  # Hilbert-curve partitioning

# Spatial join across partitions
result = dask_geopandas.sjoin(dask_left, right, predicate="within")
result_gdf = result.compute()

Hilbert-curve spatial shuffling ensures partitions are spatially coherent, so each worker’s join only touches relevant right-side geometries.

DuckDB Spatial extension

import duckdb

duckdb.install_extension("spatial")
duckdb.load_extension("spatial")

result = duckdb.sql("""
    SELECT p.*, n.name as neighborhood
    FROM read_parquet('points.parquet') p
    JOIN read_parquet('neighborhoods.parquet') n
    ON ST_Within(
        ST_Point(p.lng, p.lat),
        ST_GeomFromWKB(n.geometry)
    )
""").df()

DuckDB processes the join in a columnar, vectorized engine and can handle datasets larger than memory by spilling to disk.

H3 pre-bucketing strategy

The fastest approach for point-in-polygon joins: pre-index both sides with H3:

import h3

# 1. Assign each point to an H3 cell
points["h3"] = points.apply(
    lambda r: h3.latlng_to_cell(r.lat, r.lng, res=7), axis=1
)

# 2. Assign each polygon to the H3 cells it covers
polygon_cells = {}
for idx, row in polygons.iterrows():
    cells = h3.polygon_to_cells(row.geometry.__geo_interface__, res=7)
    for cell in cells:
        polygon_cells.setdefault(cell, []).append(idx)

# 3. Join via H3 cell lookup (O(1) per point)
def find_polygon(row):
    candidates = polygon_cells.get(row.h3, [])
    for cand_idx in candidates:
        if polygons.geometry.iloc[cand_idx].contains(
            Point(row.lng, row.lat)
        ):
            return cand_idx
    return None

points["polygon_idx"] = points.apply(find_polygon, axis=1)

This eliminates the R-tree entirely. The H3 lookup narrows candidates to 1-3 polygons per point, and the contains check runs only on those.

Handling edge cases

Many-to-many joins

When geometries overlap (e.g., overlapping administrative zones), a single point matches multiple polygons:

joined = gpd.sjoin(points, overlapping_zones, predicate="within")
# Result has more rows than left input

# Keep only the first match
joined = joined.drop_duplicates(subset=points.index.name or "index")

Near-boundary points

Points exactly on a polygon boundary may or may not match depending on the predicate. intersects includes boundary; within uses the DE-9IM “TF**FFF” pattern which excludes boundary points in some GEOS versions. For robustness, buffer polygons slightly:

polygons["geometry"] = polygons.geometry.buffer(0.001)  # tiny buffer

Invalid geometries

Invalid geometries (self-intersecting polygons) cause silent failures or wrong results:

# Check validity
invalid = polygons[~polygons.is_valid]
print(f"{len(invalid)} invalid geometries")

# Fix
polygons["geometry"] = polygons.geometry.make_valid()

Always validate before joining.

Memory optimization

For very large joins that exceed available memory:

# Process in spatial tiles
from shapely.geometry import box

bounds = left.total_bounds  # [minx, miny, maxx, maxy]
tile_size = 0.1  # degrees

results = []
for x in np.arange(bounds[0], bounds[2], tile_size):
    for y in np.arange(bounds[1], bounds[3], tile_size):
        tile = box(x, y, x + tile_size, y + tile_size)
        left_chunk = left[left.intersects(tile)]
        right_chunk = right[right.intersects(tile)]
        if len(left_chunk) > 0 and len(right_chunk) > 0:
            chunk_result = gpd.sjoin(left_chunk, right_chunk, predicate="within")
            results.append(chunk_result)

final = pd.concat(results).drop_duplicates()

This trades computation time for memory by processing the join in spatial tiles.

Production pipeline pattern

Raw CSV/Parquet (lat, lng, attributes)
  → Load into DuckDB or Pandas
  → Assign H3 cells (coarse pre-filter)
  → Group by H3 cell
  → For each group: GeoPandas sjoin against clipped polygons
  → Concatenate results
  → Write to Parquet with H3 cell as partition key

This hybrid approach combines H3’s O(1) lookup with GeoPandas’ exact geometry testing, scaling to billion-row datasets on a single machine.

The one thing to remember: The biggest spatial join performance wins come before the join itself — project to meters, simplify geometries, pre-bucket with H3, and choose the right engine (GeoPandas for convenience, DuckDB for scale, PostGIS for concurrent access).

pythonspatial-joinsgeospatialperformance