Shapely Geometry — Deep Dive
Shapely 2.0 rewrote the library’s internals around vectorized GEOS operations exposed through NumPy arrays, replacing the slower object-by-object approach of Shapely 1.x. This guide covers the new performance model, spatial indexing, coordinate transformations, and patterns for processing millions of geometries efficiently.
Shapely 2.0 architecture
From objects to arrays
Shapely 1.x wrapped each GEOS geometry in a Python object with method-based operations. Shapely 2.0 stores geometries as lightweight wrappers and provides module-level functions that accept arrays:
import shapely
from shapely import Point, Polygon
import numpy as np
# Create 100,000 random points
coords = np.random.uniform(0, 100, (100_000, 2))
points = shapely.points(coords) # returns an array of Point geometries
# Vectorized operation: which points are within a polygon?
zone = Polygon([(10, 10), (50, 10), (50, 50), (10, 50)])
inside = shapely.within(points, zone) # boolean array, computed in C
print(inside.sum()) # fast — no Python loop
The vectorized path is 10–50× faster than iterating with Python for loops because the loop runs entirely in compiled C code.
Key vectorized functions
shapely.area(polygons) # array of areas
shapely.distance(geom_a, geom_b) # pairwise distances
shapely.intersection(a, b) # pairwise intersections
shapely.buffer(geoms, distance) # array of buffered geometries
shapely.unary_union(geoms) # merge all into one geometry
shapely.contains(a, b) # pairwise containment test
Spatial indexing with STRtree
Checking every pair of geometries is O(n²). Spatial indexes reduce this to O(n log n) by pre-sorting geometries by their bounding boxes.
Building and querying an STRtree
from shapely import STRtree
tree = STRtree(points) # build index from array of geometries
# Find all points within 5.0 units of a query point
query_point = Point(25, 25)
buffer = query_point.buffer(5.0)
indices = tree.query(buffer) # candidate indices (bounding box filter)
# Refine with exact geometry test
exact = indices[shapely.within(points[indices], buffer)]
tree.query() returns indices of geometries whose bounding boxes intersect the query geometry. This is a fast filter — always follow with an exact predicate for precision.
Bulk nearest neighbor
# For each point in set A, find the nearest point in set B
tree_b = STRtree(points_b)
indices, distances = tree_b.query_nearest(points_a, return_distance=True)
This runs in O(n log n) and is the foundation for spatial joins in GeoPandas.
Prepared geometries
When you test the same geometry against many others, preparing it pre-computes internal indexes:
from shapely import prepare
zone = Polygon([(0, 0), (100, 0), (100, 100), (0, 100)])
prepare(zone) # builds internal spatial index
# Subsequent operations with this geometry are faster
results = shapely.contains(zone, points) # uses prepared index
Preparation helps when one geometry is tested repeatedly — a city boundary checked against millions of GPS traces, for example.
Coordinate transformations
Shapely 2.0 provides shapely.transform() for applying coordinate transformations:
import shapely
from pyproj import Transformer
# WGS-84 (lat/lng) to UTM zone 33N (meters)
transformer = Transformer.from_crs("EPSG:4326", "EPSG:32633", always_xy=True)
def project(coords):
return transformer.transform(coords[:, 0], coords[:, 1])
# Transform all geometries
projected = shapely.transform(geometries, project)
# Now areas are in square meters
areas_m2 = shapely.area(projected)
Why projection matters
The area of a 1° × 1° square at the equator is ~12,300 km². The same square at 60°N is ~6,150 km². Without projection, area calculations are wrong by up to 2×.
Advanced operations
Voronoi diagrams
Partition space so each region contains all points closest to one input point:
from shapely.ops import voronoi_diagram
points_geom = shapely.MultiPoint(coords)
regions = voronoi_diagram(points_geom)
for polygon in regions.geoms:
print(polygon.area)
Use cases: service area analysis, cell tower coverage estimation, and market territory definition.
Polygon simplification
Reduce vertex count while preserving shape:
# Douglas-Peucker simplification
simplified = shapely.simplify(complex_polygon, tolerance=0.001)
# Preserve topology (prevents self-intersections)
simplified = shapely.simplify(complex_polygon, tolerance=0.001, preserve_topology=True)
Simplification is critical before serializing to GeoJSON for web maps. A detailed coastline might have 500K vertices; simplified to 5K vertices, it renders 100× faster with minimal visual difference.
Polygon validity and repair
Real-world data often contains invalid geometries (self-intersections, duplicate points):
from shapely.validation import explain_validity, make_valid
poly = Polygon([(0, 0), (2, 2), (2, 0), (0, 2)]) # bowtie — self-intersecting
print(shapely.is_valid(poly)) # False
print(explain_validity(poly)) # "Self-intersection[1 1]"
fixed = make_valid(poly) # splits into valid geometries
print(shapely.is_valid(fixed)) # True
Always validate geometries loaded from external sources before performing operations.
Performance patterns
Avoid Python loops
# Slow: Python loop
areas = [geom.area for geom in polygons]
# Fast: vectorized
areas = shapely.area(polygons)
The vectorized version is typically 20–50× faster for arrays of 10K+ geometries.
Batch union with unary_union
Merging geometries one by one is O(n²). Use unary_union which employs a cascaded union strategy:
# Slow: iterative
result = polygons[0]
for p in polygons[1:]:
result = result.union(p)
# Fast: cascaded
result = shapely.unary_union(polygons)
Memory-efficient WKB serialization
Store geometries compactly using Well-Known Binary (WKB):
wkb_bytes = shapely.to_wkb(geometries) # array of bytes objects
restored = shapely.from_wkb(wkb_bytes) # back to geometry array
WKB is the format used by PostGIS and most spatial databases. It is more compact than WKT (text) and faster to parse.
Integration with the Python geospatial stack
| Library | Role | Shapely interaction |
|---|---|---|
| GeoPandas | DataFrames with geometry columns | Built on Shapely geometries |
| Fiona | Read/write shapefiles, GeoJSON | Returns Shapely-compatible geometry dicts |
| pyproj | Coordinate transformations | Transform Shapely coords between CRS |
| Rasterio | Raster data (satellite imagery) | Use Shapely geometries for masking/clipping |
| Folium | Interactive maps | Serialize Shapely to GeoJSON for display |
Common pitfalls
- Mixing lon/lat and lat/lon. Shapely uses (x, y) = (longitude, latitude). Many APIs return (latitude, longitude). Swapping them produces geometries in the wrong hemisphere.
- Measuring in degrees. Shapely does not know about the Earth’s curvature. Always project before measuring.
- Invalid geometries from unions. Floating-point precision can create slivers and self-intersections. Use
make_valid()after complex operations. - Forgetting STRtree for large datasets. A spatial join without an index on 100K geometries takes minutes; with an index, it takes seconds.
The one thing to remember: Shapely 2.0’s vectorized operations and STRtree indexing make million-geometry workflows practical in Python — but only if you avoid Python loops and always project coordinates before measuring.
See Also
- Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
- Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
- Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
- Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
- Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.