Spatial Joins Performance — Core Concepts
A spatial join matches rows from two geographic datasets based on their spatial relationship — “which points fall inside which polygons?” or “which lines intersect which boundaries?” In Python, GeoPandas provides the primary interface, but understanding the underlying mechanics is essential for performance.
Basic spatial join
import geopandas as gpd
points = gpd.read_file("restaurants.geojson") # 500K points
polygons = gpd.read_file("neighborhoods.geojson") # 2K polygons
joined = gpd.sjoin(points, polygons, predicate="within")
sjoin returns every point annotated with the attributes of the polygon it falls within. Points that match no polygon are dropped (inner join by default).
How the spatial index works
GeoPandas uses an STRtree (Sort-Tile-Recursive) spatial index from Shapely. The process:
- Build index — the right GeoDataFrame’s bounding boxes are loaded into an R-tree structure. This takes O(n log n) time.
- Query — for each geometry on the left side, the index finds candidate matches whose bounding boxes overlap. This is O(log n) per query.
- Refine — candidates are tested with the exact geometric predicate (within, intersects, contains). This is the expensive part, but only runs on a small fraction of pairs.
Without index: 500K × 2K = 1 billion comparisons
With index: 500K × ~3 candidates avg = 1.5 million comparisons
Speedup: ~660×
Predicate types
| Predicate | Meaning | Use case |
|---|---|---|
intersects | Geometries share any space | Default, most inclusive |
within | Left is completely inside right | Points in polygons |
contains | Right is completely inside left | Polygons containing points |
crosses | Geometries cross each other | Roads crossing rivers |
overlaps | Partial overlap between same-dimension geometries | Overlapping zones |
Choose the most restrictive predicate for your use case — it reduces false positives from the bounding-box phase.
Spatial join types
# Inner join (default): only matching rows
inner = gpd.sjoin(points, polygons, how="inner", predicate="within")
# Left join: keep all points, NaN where no match
left = gpd.sjoin(points, polygons, how="left", predicate="within")
# Points matching multiple polygons appear as multiple rows
# Handle with: joined.drop_duplicates(subset="geometry")
Performance factors
Geometry complexity
A polygon with 10 vertices is cheap to test. A polygon with 100,000 vertices (detailed coastline) is expensive. Simplify complex geometries before joining:
polygons["geometry"] = polygons.geometry.simplify(tolerance=50) # meters
CRS matters
Always ensure both GeoDataFrames use the same CRS. Spatial indexes on unprojected data (EPSG:4326) work but are less efficient than projected coordinates (UTM).
points = points.to_crs(epsg=32618)
polygons = polygons.to_crs(epsg=32618)
Index creation
GeoPandas builds the spatial index lazily on first access. For repeated joins with the same right-side data, the index is reused automatically:
# Force index creation
polygons.sindex # builds STRtree
Nearest join
When you need “find the closest feature” instead of “find features that overlap”:
# Nearest restaurant to each park
nearest = gpd.sjoin_nearest(
parks, restaurants, distance_col="dist_m", max_distance=1000
)
sjoin_nearest also uses the spatial index, returning the closest feature within the optional max_distance threshold.
Common misconception
Many people blame GeoPandas for slow spatial joins when the real problem is geometry complexity. A million simple points joining against simplified polygons takes seconds. The same join against unsimplified coastline polygons with millions of vertices can take minutes. Simplify first, join second.
The one thing to remember: Spatial join performance is dominated by two factors — the spatial index eliminates 99% of pair comparisons, and geometry simplification reduces the cost of the remaining 1%.
See Also
- Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
- Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
- Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
- Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
- Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.