GeoPandas Spatial Data — Deep Dive

Scale GeoPandas workflows with spatial indexing, GeoParquet, Dask-GeoPandas parallelism, and production-grade spatial analysis patterns.

GeoPandas makes spatial analysis accessible, but scaling it to millions of geometries, integrating with databases, and building reproducible pipelines requires deeper knowledge of its internals, performance levers, and ecosystem integrations.

Under the hood

Geometry engine: Shapely 2.0

Since GeoPandas 0.14, the geometry column stores a Shapely 2.0 array. Spatial operations dispatch to vectorized GEOS functions through shapely.lib, avoiding Python-level loops. This means gdf.intersects(polygon) runs the loop in C, not Python.

Spatial index

GeoPandas builds an R-tree spatial index (via pygeos/shapely STRtree) lazily the first time you call gdf.sindex. The index accelerates bounding-box queries from O(n) to O(log n).

# Force index creation
gdf.sindex

# Query: which geometries might intersect this bounding box?
candidates = list(gdf.sindex.intersection(query_geom.bounds))
# Refine with exact test
matches = gdf.iloc[candidates][gdf.iloc[candidates].intersects(query_geom)]

sjoin uses the spatial index automatically. For custom operations, accessing sindex directly gives you control over the two-phase filter (bounding box → exact geometry).

Performance optimization

Profile first

import time

start = time.perf_counter()
result = gpd.sjoin(points_gdf, polygons_gdf, predicate="within")
elapsed = time.perf_counter() - start
print(f"sjoin: {elapsed:.2f}s for {len(points_gdf)} points × {len(polygons_gdf)} polygons")

Common bottlenecks and fixes

Bottleneck	Symptom	Fix
No spatial index	sjoin takes minutes	Ensure sindex is built (automatic in sjoin)
Complex polygons	Operations slow per-geometry	Simplify with `gdf.simplify(tolerance)`
Large GeoJSON I/O	Read/write takes 30+ seconds	Use GeoParquet or GeoPackage
Python-level loops	Custom operations are slow	Use vectorized Shapely 2.0 functions
Memory overflow	DataFrame exceeds RAM	Use Dask-GeoPandas or process in chunks

GeoParquet for fast I/O

# Write — 10× faster to read back than GeoJSON
gdf.to_parquet("buildings.parquet")

# Read — supports column pruning and row group filtering
gdf = gpd.read_parquet("buildings.parquet", columns=["name", "geometry"])

GeoParquet stores geometry as WKB in a Parquet column with standardized metadata. It supports predicate pushdown in DuckDB and Spark, enabling SQL queries on spatial files without loading them fully into memory.

Simplification before visualization

When rendering 100K+ polygons on a map, simplify first:

# Reduce vertex count by ~90% with minimal visual change
gdf["geometry"] = gdf.simplify(tolerance=0.0001, preserve_topology=True)

Choose tolerance based on your CRS: 0.0001 degrees ≈ 11 meters at the equator.

Dask-GeoPandas for parallel processing

For datasets that exceed memory or benefit from parallel computation:

import dask_geopandas

ddf = dask_geopandas.read_parquet("buildings.parquet", npartitions=8)

# Spatial operations work lazily
buffered = ddf.buffer(100)
result = buffered.compute()  # triggers execution across partitions

Spatial partitioning

For efficient spatial joins on large datasets, partition by Hilbert curve:

ddf = dask_geopandas.from_geopandas(gdf, npartitions=16)
ddf = ddf.spatial_shuffle(by="hilbert")  # co-locate nearby geometries

# sjoin is now partition-local where possible
result = dask_geopandas.sjoin(ddf_points, ddf_polygons, predicate="within")

Hilbert partitioning ensures that spatially nearby geometries end up in the same partition, minimizing cross-partition joins.

Database integration

PostGIS read/write

from sqlalchemy import create_engine

engine = create_engine("postgresql://user:pass@localhost/geodb")

# Read with spatial filter (pushed to PostGIS)
gdf = gpd.read_postgis(
    "SELECT * FROM buildings WHERE ST_Intersects(geom, ST_MakeEnvelope(13.3, 52.5, 13.4, 52.6, 4326))",
    engine, geom_col="geom",
)

# Write
gdf.to_postgis("buildings_clean", engine, if_exists="replace", index=True)

DuckDB spatial extension

import duckdb

conn = duckdb.connect()
conn.install_extension("spatial")
conn.load_extension("spatial")

# Query GeoParquet directly with SQL
result = conn.execute("""
    SELECT name, ST_Area(ST_Transform(geometry, 'EPSG:4326', 'EPSG:32633')) as area_m2
    FROM read_parquet('buildings.parquet')
    WHERE ST_Within(geometry, ST_GeomFromText('POLYGON((13.3 52.5, 13.4 52.5, 13.4 52.6, 13.3 52.6, 13.3 52.5))'))
""").fetchdf()

DuckDB processes GeoParquet files out-of-core, handling datasets larger than RAM.

Advanced spatial analysis

Overlay operations

Combine two polygon layers with a spatial predicate:

# Find the intersection of land use zones and flood areas
flood_risk = gpd.overlay(landuse_gdf, flood_gdf, how="intersection")
flood_risk["affected_area_km2"] = flood_risk.to_crs(epsg=32633).area / 1e6

Overlay modes: intersection, union, difference, symmetric_difference, identity.

Nearest join

Find the nearest geometry in another GeoDataFrame:

# For each school, find the nearest hospital
result = gpd.sjoin_nearest(
    schools_gdf, hospitals_gdf,
    how="left", distance_col="dist_to_hospital",
    max_distance=10000,  # 10km max (requires projected CRS)
)

Dissolve (spatial groupby)

Merge geometries by a grouping column:

# Merge neighborhoods into districts
districts = neighborhoods_gdf.dissolve(by="district_name", aggfunc="sum")
# Each district is now a single (multi)polygon with summed attributes

Spatial autocorrelation

Detect clustering patterns with PySAL integration:

from libpysal.weights import Queen
from esda.moran import Moran

weights = Queen.from_dataframe(gdf)
moran = Moran(gdf["crime_rate"], weights)
print(f"Moran's I: {moran.I:.3f}, p-value: {moran.p_sim:.4f}")
# Moran's I > 0 with low p-value indicates spatial clustering

Pipeline patterns

Reproducible spatial ETL

def spatial_etl(
    buildings_path: str,
    zones_path: str,
    output_path: str,
    target_crs: str = "EPSG:32633",
):
    # Extract
    buildings = gpd.read_parquet(buildings_path)
    zones = gpd.read_file(zones_path)

    # Transform
    buildings = buildings.to_crs(target_crs)
    zones = zones.to_crs(target_crs)

    # Validate
    invalid = ~buildings.is_valid
    if invalid.any():
        buildings.loc[invalid, "geometry"] = buildings.loc[invalid].make_valid()

    # Spatial join
    result = gpd.sjoin(buildings, zones, predicate="within")
    result["area_m2"] = result.area

    # Load
    result.to_parquet(output_path)
    return {"rows": len(result), "invalid_fixed": int(invalid.sum())}

Testing spatial pipelines

import pytest
from shapely.geometry import Point, Polygon

def test_spatial_join_assigns_zone():
    points = gpd.GeoDataFrame(
        {"name": ["A"]},
        geometry=[Point(1, 1)],
        crs="EPSG:4326",
    )
    zones = gpd.GeoDataFrame(
        {"zone": ["Z1"]},
        geometry=[Polygon([(0, 0), (2, 0), (2, 2), (0, 2)])],
        crs="EPSG:4326",
    )
    result = gpd.sjoin(points, zones, predicate="within")
    assert result.iloc[0]["zone"] == "Z1"

Common pitfalls

Mixing CRS in spatial joins. GeoPandas raises a warning but proceeds — results will be wrong. Always ensure both GeoDataFrames share the same CRS before joining.
Using apply() for spatial operations. Apply runs a Python function per row. Use vectorized methods (gdf.intersects(), gdf.distance()) instead.
Ignoring invalid geometries. sjoin silently drops rows with invalid geometry. Validate with gdf.is_valid and fix with gdf.make_valid() first.
Reading huge shapefiles. Shapefiles lack spatial filtering support. Convert to GeoParquet once, then query efficiently.

The one thing to remember: GeoPandas scales from notebook exploration to production pipelines when you use GeoParquet for I/O, spatial indexes for joins, vectorized operations instead of loops, and Dask-GeoPandas when data outgrows memory.

pythongeopandasgeospatialdata-analysis