Spatial Joins Performance — Deep Dive
Spatial joins are the most performance-sensitive operation in geospatial Python. Moving from simple sjoin calls to million-to-million geometry matching requires understanding index internals, geometry preparation, partitioning strategies, and alternative engines.
R-tree index internals
GeoPandas delegates spatial indexing to Shapely’s STRtree (Sort-Tile-Recursive tree). The STR algorithm:
- Sorts geometries by the x-coordinate of their bounding-box centroid
- Packs them into leaf nodes (default fanout ≈ 10)
- Recursively builds parent nodes by sorting y-coordinates
- Result: a balanced tree where bounding boxes at each level tightly enclose their children
Query mechanics
from shapely import STRtree
tree = STRtree(polygon_geometries)
# Query: which polygons might contain this point?
candidates_idx = tree.query(point, predicate="intersects")
# Returns array of indices into polygon_geometries
The predicate parameter in Shapely 2.0 enables index-level predicate pushdown. With predicate="contains", the STRtree performs both the bounding-box check AND the exact geometry test in C, avoiding Python overhead for each candidate.
Bulk query for join operations
from shapely import STRtree
import numpy as np
tree = STRtree(right_geoms)
# Query all left geometries at once
left_idx, right_idx = tree.query(left_geoms, predicate="intersects")
# Returns two arrays of matching indices
This bulk query is what GeoPandas sjoin uses internally. The entire loop — build tree, query all geometries, return index pairs — runs in C with no per-geometry Python calls.
Prepared geometries
For repeated tests against the same geometry (e.g., checking millions of points against one complex polygon), GEOS “prepares” the geometry by pre-computing internal acceleration structures:
from shapely import prepare
complex_polygon = polygons.geometry.iloc[0]
prepare(complex_polygon) # pre-computes edge index
# Subsequent operations are 10-100× faster
results = [complex_polygon.contains(p) for p in points]
Shapely 2.0 automatically prepares geometries during bulk operations, but explicit preparation helps when you control the loop.
GeoPandas sjoin optimization checklist
import geopandas as gpd
# 1. Project to meters (tighter bounding boxes, faster index)
left = left.to_crs(epsg=32618)
right = right.to_crs(epsg=32618)
# 2. Simplify complex geometries on the right side
right["geometry"] = right.geometry.simplify(tolerance=10) # meters
# 3. Use the most restrictive predicate
joined = gpd.sjoin(left, right, predicate="within") # not "intersects"
# 4. Reduce columns before join (less memory, faster copy)
right_slim = right[["geometry", "region_id", "name"]]
joined = gpd.sjoin(left, right_slim, predicate="within")
Benchmarks: join strategies compared
Testing 1M points against 10K polygons (avg 100 vertices each):
| Strategy | Time | Memory | Notes |
|---|---|---|---|
| GeoPandas sjoin (default) | 8s | 2 GB | STRtree + vectorized |
| GeoPandas sjoin (unsimplified, 10K vertices/polygon) | 180s | 2 GB | Geometry test dominates |
| Shapely bulk query (manual) | 5s | 1.5 GB | Skip DataFrame overhead |
| Dask-GeoPandas (4 workers) | 3s | 4 GB | Parallel partitions |
| DuckDB Spatial | 2s | 0.5 GB | Columnar + native |
| PostGIS (indexed) | 1.5s | N/A | Server-side R-tree |
| H3 pre-bucketing + sjoin | 1s | 1 GB | Index elimination |
Dask-GeoPandas for parallel joins
import dask_geopandas
# Partition left GeoDataFrame spatially
dask_left = dask_geopandas.from_geopandas(left, npartitions=16)
dask_left = dask_left.spatial_shuffle() # Hilbert-curve partitioning
# Spatial join across partitions
result = dask_geopandas.sjoin(dask_left, right, predicate="within")
result_gdf = result.compute()
Hilbert-curve spatial shuffling ensures partitions are spatially coherent, so each worker’s join only touches relevant right-side geometries.
DuckDB Spatial extension
import duckdb
duckdb.install_extension("spatial")
duckdb.load_extension("spatial")
result = duckdb.sql("""
SELECT p.*, n.name as neighborhood
FROM read_parquet('points.parquet') p
JOIN read_parquet('neighborhoods.parquet') n
ON ST_Within(
ST_Point(p.lng, p.lat),
ST_GeomFromWKB(n.geometry)
)
""").df()
DuckDB processes the join in a columnar, vectorized engine and can handle datasets larger than memory by spilling to disk.
H3 pre-bucketing strategy
The fastest approach for point-in-polygon joins: pre-index both sides with H3:
import h3
# 1. Assign each point to an H3 cell
points["h3"] = points.apply(
lambda r: h3.latlng_to_cell(r.lat, r.lng, res=7), axis=1
)
# 2. Assign each polygon to the H3 cells it covers
polygon_cells = {}
for idx, row in polygons.iterrows():
cells = h3.polygon_to_cells(row.geometry.__geo_interface__, res=7)
for cell in cells:
polygon_cells.setdefault(cell, []).append(idx)
# 3. Join via H3 cell lookup (O(1) per point)
def find_polygon(row):
candidates = polygon_cells.get(row.h3, [])
for cand_idx in candidates:
if polygons.geometry.iloc[cand_idx].contains(
Point(row.lng, row.lat)
):
return cand_idx
return None
points["polygon_idx"] = points.apply(find_polygon, axis=1)
This eliminates the R-tree entirely. The H3 lookup narrows candidates to 1-3 polygons per point, and the contains check runs only on those.
Handling edge cases
Many-to-many joins
When geometries overlap (e.g., overlapping administrative zones), a single point matches multiple polygons:
joined = gpd.sjoin(points, overlapping_zones, predicate="within")
# Result has more rows than left input
# Keep only the first match
joined = joined.drop_duplicates(subset=points.index.name or "index")
Near-boundary points
Points exactly on a polygon boundary may or may not match depending on the predicate. intersects includes boundary; within uses the DE-9IM “TF**FFF” pattern which excludes boundary points in some GEOS versions. For robustness, buffer polygons slightly:
polygons["geometry"] = polygons.geometry.buffer(0.001) # tiny buffer
Invalid geometries
Invalid geometries (self-intersecting polygons) cause silent failures or wrong results:
# Check validity
invalid = polygons[~polygons.is_valid]
print(f"{len(invalid)} invalid geometries")
# Fix
polygons["geometry"] = polygons.geometry.make_valid()
Always validate before joining.
Memory optimization
For very large joins that exceed available memory:
# Process in spatial tiles
from shapely.geometry import box
bounds = left.total_bounds # [minx, miny, maxx, maxy]
tile_size = 0.1 # degrees
results = []
for x in np.arange(bounds[0], bounds[2], tile_size):
for y in np.arange(bounds[1], bounds[3], tile_size):
tile = box(x, y, x + tile_size, y + tile_size)
left_chunk = left[left.intersects(tile)]
right_chunk = right[right.intersects(tile)]
if len(left_chunk) > 0 and len(right_chunk) > 0:
chunk_result = gpd.sjoin(left_chunk, right_chunk, predicate="within")
results.append(chunk_result)
final = pd.concat(results).drop_duplicates()
This trades computation time for memory by processing the join in spatial tiles.
Production pipeline pattern
Raw CSV/Parquet (lat, lng, attributes)
→ Load into DuckDB or Pandas
→ Assign H3 cells (coarse pre-filter)
→ Group by H3 cell
→ For each group: GeoPandas sjoin against clipped polygons
→ Concatenate results
→ Write to Parquet with H3 cell as partition key
This hybrid approach combines H3’s O(1) lookup with GeoPandas’ exact geometry testing, scaling to billion-row datasets on a single machine.
The one thing to remember: The biggest spatial join performance wins come before the join itself — project to meters, simplify geometries, pre-bucket with H3, and choose the right engine (GeoPandas for convenience, DuckDB for scale, PostGIS for concurrent access).
See Also
- Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
- Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
- Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
- Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
- Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.