GeoPy Geocoding — Deep Dive
Production geocoding involves more than calling an API. You need caching to avoid redundant lookups, fallback chains when providers fail, accuracy validation, and a processing architecture that handles millions of addresses without exhausting rate limits or budgets.
Provider architecture
Fallback chains
No single provider handles every address perfectly. Build a chain that tries cheaper or free providers first and falls back to paid ones:
from geopy.geocoders import Nominatim, GoogleV3, MapBox
from geopy.exc import GeocoderTimedOut, GeocoderServiceError
providers = [
Nominatim(user_agent="my-production-app", timeout=5),
MapBox(api_key="pk.xxx", timeout=5),
GoogleV3(api_key="AIza...", timeout=5),
]
def geocode_with_fallback(address: str):
for provider in providers:
try:
result = provider.geocode(address, exactly_one=True)
if result:
return {
"lat": result.latitude,
"lng": result.longitude,
"provider": type(provider).__name__,
"raw": result.raw,
}
except (GeocoderTimedOut, GeocoderServiceError):
continue
return None
Provider-specific tuning
Nominatim returns better results when you add structured query parameters:
result = geolocator.geocode(
query=None,
street="221B Baker Street",
city="London",
country="UK",
addressdetails=True,
)
# result.raw["address"]["postcode"] → "NW1 6XE"
Google Maps supports components for disambiguation:
from geopy.geocoders import GoogleV3
google = GoogleV3(api_key="AIza...")
result = google.geocode(
"Springfield",
components={"administrative_area": "IL", "country": "US"},
)
Caching layer
Geocoding the same address twice wastes API calls and money. Implement a cache:
SQLite cache
import sqlite3
import json
from hashlib import sha256
class GeoCache:
def __init__(self, db_path="geocache.db"):
self.conn = sqlite3.connect(db_path)
self.conn.execute("""
CREATE TABLE IF NOT EXISTS cache (
address_hash TEXT PRIMARY KEY,
address TEXT,
result TEXT,
provider TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
def get(self, address: str):
h = sha256(address.lower().strip().encode()).hexdigest()
row = self.conn.execute(
"SELECT result FROM cache WHERE address_hash = ?", (h,)
).fetchone()
return json.loads(row[0]) if row else None
def put(self, address: str, result: dict, provider: str):
h = sha256(address.lower().strip().encode()).hexdigest()
self.conn.execute(
"INSERT OR REPLACE INTO cache VALUES (?, ?, ?, ?, CURRENT_TIMESTAMP)",
(h, address, json.dumps(result), provider),
)
self.conn.commit()
Cache hit rates
In practice, business datasets have high address duplication. A delivery company with 100K daily orders might have only 15K unique addresses. Caching gives you an 85% hit rate, reducing API costs proportionally.
Address normalization
Before geocoding, normalize addresses to improve cache hits and accuracy:
import re
def normalize_address(address: str) -> str:
address = address.upper().strip()
# Standardize common abbreviations
replacements = {
r"\bST\b": "STREET", r"\bAVE\b": "AVENUE",
r"\bBLVD\b": "BOULEVARD", r"\bDR\b": "DRIVE",
r"\bRD\b": "ROAD", r"\bAPT\b": "APARTMENT",
r"\b#\b": "APARTMENT",
}
for pattern, replacement in replacements.items():
address = re.sub(pattern, replacement, address)
# Remove extra whitespace
address = re.sub(r"\s+", " ", address)
return address
For production systems, consider the usaddress library which parses US addresses into tagged components (street number, street name, city, state, ZIP).
Accuracy validation
Geocoding results vary in precision. A result might be accurate to the building, the street, or just the city. Check the result metadata:
Google precision levels
result = google.geocode("123 Main St, Anytown, US")
precision = result.raw["geometry"]["location_type"]
# ROOFTOP — exact building
# RANGE_INTERPOLATED — estimated between two points
# GEOMETRIC_CENTER — center of a region
# APPROXIMATE — city or area level
Nominatim confidence
result = nominatim.geocode("123 Main St", addressdetails=True)
importance = result.raw.get("importance", 0)
place_rank = result.raw.get("place_rank", 0)
# place_rank 26-30 = building level, 16-18 = city level
Validation rules
def validate_result(result: dict, expected_country: str = None) -> bool:
if not result:
return False
lat, lng = result["lat"], result["lng"]
# Basic sanity: coordinates within valid range
if not (-90 <= lat <= 90 and -180 <= lng <= 180):
return False
# Not the null island (0, 0) which many services return for failures
if abs(lat) < 0.1 and abs(lng) < 0.1:
return False
# Country validation if available
if expected_country and result.get("country") != expected_country:
return False
return True
High-throughput batch processing
Async geocoding
For large batches, use asyncio with aiohttp to parallelize requests (respecting rate limits):
import asyncio
from geopy.adapters import AioHTTPAdapter
from geopy.geocoders import Nominatim
async def batch_geocode(addresses: list[str], max_concurrent: int = 5):
semaphore = asyncio.Semaphore(max_concurrent)
async with Nominatim(
user_agent="batch-app",
adapter_factory=AioHTTPAdapter,
) as geolocator:
async def geocode_one(addr):
async with semaphore:
await asyncio.sleep(1.0) # rate limit
return await geolocator.geocode(addr)
tasks = [geocode_one(addr) for addr in addresses]
return await asyncio.gather(*tasks, return_exceptions=True)
Worker pool pattern
For millions of addresses, distribute work across multiple provider accounts:
- Partition addresses into chunks.
- Assign each chunk to a worker with its own API key and rate limit.
- Workers write results to a shared database.
- A supervisor monitors progress and retries failures.
With 5 Google API keys at 50 req/sec each, you process 250 addresses per second — 900K per hour.
Distance and routing
Geodesic vs. great-circle
GeoPy’s geodesic() uses the Vincenty formula on the WGS-84 ellipsoid. It is accurate to within 0.5mm but can fail to converge for nearly antipodal points. The great_circle() function uses the Haversine formula on a sphere — less accurate (up to 0.5% error) but always converges.
from geopy.distance import geodesic, great_circle
a = (40.7128, -74.0060) # New York
b = (34.0522, -118.2437) # Los Angeles
print(geodesic(a, b).km) # 3944.42
print(great_circle(a, b).km) # 3936.39 (0.2% difference)
Bounding box queries
To find all points within a radius, compute a bounding box first (fast filter) then refine with geodesic distance (accurate filter):
from geopy.distance import geodesic
from geopy import Point
def bounding_box(center: tuple, radius_km: float):
north = geodesic(kilometers=radius_km).destination(Point(*center), 0)
south = geodesic(kilometers=radius_km).destination(Point(*center), 180)
east = geodesic(kilometers=radius_km).destination(Point(*center), 90)
west = geodesic(kilometers=radius_km).destination(Point(*center), 270)
return {
"min_lat": south.latitude, "max_lat": north.latitude,
"min_lng": west.longitude, "max_lng": east.longitude,
}
Use the bounding box as a SQL WHERE clause, then compute exact distances only for the filtered candidates.
Cost optimization
| Strategy | Impact |
|---|---|
| SQLite/Redis cache | 70–90% fewer API calls |
| Address normalization | 10–20% more cache hits |
| Fallback chain (free → paid) | 40–60% cost reduction |
| Batch processing during off-peak | Some providers offer lower rates |
| Self-hosted Nominatim | Zero per-request cost after setup |
For high-volume production (>1M lookups/month), self-hosting a Nominatim instance on a machine with 64GB RAM and the full planet OSM data (~100GB) eliminates per-request costs entirely.
The one thing to remember: Production geocoding is a data pipeline problem — caching, normalization, fallback chains, and validation matter more than which API you call.
See Also
- Python Adaptive Learning Systems How Python builds learning apps that adjust to each student like a personal tutor who knows exactly what you need next.
- Python Airflow Learn Airflow as a timetable manager that makes sure data tasks run in the right order every day.
- Python Altair Learn Altair through the idea of drawing charts by describing rules, not by hand-placing every visual element.
- Python Automated Grading How Python grades homework and exams automatically, from simple answer keys to understanding written essays.
- Python Batch Vs Stream Processing Batch processing is like doing laundry once a week; stream processing is like a self-cleaning shirt that cleans itself constantly.