GeoPandas Spatial Data — Deep Dive
GeoPandas makes spatial analysis accessible, but scaling it to millions of geometries, integrating with databases, and building reproducible pipelines requires deeper knowledge of its internals, performance levers, and ecosystem integrations.
Under the hood
Geometry engine: Shapely 2.0
Since GeoPandas 0.14, the geometry column stores a Shapely 2.0 array. Spatial operations dispatch to vectorized GEOS functions through shapely.lib, avoiding Python-level loops. This means gdf.intersects(polygon) runs the loop in C, not Python.
Spatial index
GeoPandas builds an R-tree spatial index (via pygeos/shapely STRtree) lazily the first time you call gdf.sindex. The index accelerates bounding-box queries from O(n) to O(log n).
# Force index creation
gdf.sindex
# Query: which geometries might intersect this bounding box?
candidates = list(gdf.sindex.intersection(query_geom.bounds))
# Refine with exact test
matches = gdf.iloc[candidates][gdf.iloc[candidates].intersects(query_geom)]
sjoin uses the spatial index automatically. For custom operations, accessing sindex directly gives you control over the two-phase filter (bounding box → exact geometry).
Performance optimization
Profile first
import time
start = time.perf_counter()
result = gpd.sjoin(points_gdf, polygons_gdf, predicate="within")
elapsed = time.perf_counter() - start
print(f"sjoin: {elapsed:.2f}s for {len(points_gdf)} points × {len(polygons_gdf)} polygons")
Common bottlenecks and fixes
| Bottleneck | Symptom | Fix |
|---|---|---|
| No spatial index | sjoin takes minutes | Ensure sindex is built (automatic in sjoin) |
| Complex polygons | Operations slow per-geometry | Simplify with gdf.simplify(tolerance) |
| Large GeoJSON I/O | Read/write takes 30+ seconds | Use GeoParquet or GeoPackage |
| Python-level loops | Custom operations are slow | Use vectorized Shapely 2.0 functions |
| Memory overflow | DataFrame exceeds RAM | Use Dask-GeoPandas or process in chunks |
GeoParquet for fast I/O
# Write — 10× faster to read back than GeoJSON
gdf.to_parquet("buildings.parquet")
# Read — supports column pruning and row group filtering
gdf = gpd.read_parquet("buildings.parquet", columns=["name", "geometry"])
GeoParquet stores geometry as WKB in a Parquet column with standardized metadata. It supports predicate pushdown in DuckDB and Spark, enabling SQL queries on spatial files without loading them fully into memory.
Simplification before visualization
When rendering 100K+ polygons on a map, simplify first:
# Reduce vertex count by ~90% with minimal visual change
gdf["geometry"] = gdf.simplify(tolerance=0.0001, preserve_topology=True)
Choose tolerance based on your CRS: 0.0001 degrees ≈ 11 meters at the equator.
Dask-GeoPandas for parallel processing
For datasets that exceed memory or benefit from parallel computation:
import dask_geopandas
ddf = dask_geopandas.read_parquet("buildings.parquet", npartitions=8)
# Spatial operations work lazily
buffered = ddf.buffer(100)
result = buffered.compute() # triggers execution across partitions
Spatial partitioning
For efficient spatial joins on large datasets, partition by Hilbert curve:
ddf = dask_geopandas.from_geopandas(gdf, npartitions=16)
ddf = ddf.spatial_shuffle(by="hilbert") # co-locate nearby geometries
# sjoin is now partition-local where possible
result = dask_geopandas.sjoin(ddf_points, ddf_polygons, predicate="within")
Hilbert partitioning ensures that spatially nearby geometries end up in the same partition, minimizing cross-partition joins.
Database integration
PostGIS read/write
from sqlalchemy import create_engine
engine = create_engine("postgresql://user:pass@localhost/geodb")
# Read with spatial filter (pushed to PostGIS)
gdf = gpd.read_postgis(
"SELECT * FROM buildings WHERE ST_Intersects(geom, ST_MakeEnvelope(13.3, 52.5, 13.4, 52.6, 4326))",
engine, geom_col="geom",
)
# Write
gdf.to_postgis("buildings_clean", engine, if_exists="replace", index=True)
DuckDB spatial extension
import duckdb
conn = duckdb.connect()
conn.install_extension("spatial")
conn.load_extension("spatial")
# Query GeoParquet directly with SQL
result = conn.execute("""
SELECT name, ST_Area(ST_Transform(geometry, 'EPSG:4326', 'EPSG:32633')) as area_m2
FROM read_parquet('buildings.parquet')
WHERE ST_Within(geometry, ST_GeomFromText('POLYGON((13.3 52.5, 13.4 52.5, 13.4 52.6, 13.3 52.6, 13.3 52.5))'))
""").fetchdf()
DuckDB processes GeoParquet files out-of-core, handling datasets larger than RAM.
Advanced spatial analysis
Overlay operations
Combine two polygon layers with a spatial predicate:
# Find the intersection of land use zones and flood areas
flood_risk = gpd.overlay(landuse_gdf, flood_gdf, how="intersection")
flood_risk["affected_area_km2"] = flood_risk.to_crs(epsg=32633).area / 1e6
Overlay modes: intersection, union, difference, symmetric_difference, identity.
Nearest join
Find the nearest geometry in another GeoDataFrame:
# For each school, find the nearest hospital
result = gpd.sjoin_nearest(
schools_gdf, hospitals_gdf,
how="left", distance_col="dist_to_hospital",
max_distance=10000, # 10km max (requires projected CRS)
)
Dissolve (spatial groupby)
Merge geometries by a grouping column:
# Merge neighborhoods into districts
districts = neighborhoods_gdf.dissolve(by="district_name", aggfunc="sum")
# Each district is now a single (multi)polygon with summed attributes
Spatial autocorrelation
Detect clustering patterns with PySAL integration:
from libpysal.weights import Queen
from esda.moran import Moran
weights = Queen.from_dataframe(gdf)
moran = Moran(gdf["crime_rate"], weights)
print(f"Moran's I: {moran.I:.3f}, p-value: {moran.p_sim:.4f}")
# Moran's I > 0 with low p-value indicates spatial clustering
Pipeline patterns
Reproducible spatial ETL
def spatial_etl(
buildings_path: str,
zones_path: str,
output_path: str,
target_crs: str = "EPSG:32633",
):
# Extract
buildings = gpd.read_parquet(buildings_path)
zones = gpd.read_file(zones_path)
# Transform
buildings = buildings.to_crs(target_crs)
zones = zones.to_crs(target_crs)
# Validate
invalid = ~buildings.is_valid
if invalid.any():
buildings.loc[invalid, "geometry"] = buildings.loc[invalid].make_valid()
# Spatial join
result = gpd.sjoin(buildings, zones, predicate="within")
result["area_m2"] = result.area
# Load
result.to_parquet(output_path)
return {"rows": len(result), "invalid_fixed": int(invalid.sum())}
Testing spatial pipelines
import pytest
from shapely.geometry import Point, Polygon
def test_spatial_join_assigns_zone():
points = gpd.GeoDataFrame(
{"name": ["A"]},
geometry=[Point(1, 1)],
crs="EPSG:4326",
)
zones = gpd.GeoDataFrame(
{"zone": ["Z1"]},
geometry=[Polygon([(0, 0), (2, 0), (2, 2), (0, 2)])],
crs="EPSG:4326",
)
result = gpd.sjoin(points, zones, predicate="within")
assert result.iloc[0]["zone"] == "Z1"
Common pitfalls
- Mixing CRS in spatial joins. GeoPandas raises a warning but proceeds — results will be wrong. Always ensure both GeoDataFrames share the same CRS before joining.
- Using
apply()for spatial operations. Apply runs a Python function per row. Use vectorized methods (gdf.intersects(),gdf.distance()) instead. - Ignoring invalid geometries.
sjoinsilently drops rows with invalid geometry. Validate withgdf.is_validand fix withgdf.make_valid()first. - Reading huge shapefiles. Shapefiles lack spatial filtering support. Convert to GeoParquet once, then query efficiently.
The one thing to remember: GeoPandas scales from notebook exploration to production pipelines when you use GeoParquet for I/O, spatial indexes for joins, vectorized operations instead of loops, and Dask-GeoPandas when data outgrows memory.
See Also
- Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
- Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
- Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
- Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
- Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.