GeoPy Geocoding — Deep Dive

Build production geocoding pipelines in Python with GeoPy — covering caching, fallback providers, accuracy validation, and high-throughput batch processing.

Production geocoding involves more than calling an API. You need caching to avoid redundant lookups, fallback chains when providers fail, accuracy validation, and a processing architecture that handles millions of addresses without exhausting rate limits or budgets.

Provider architecture

Fallback chains

No single provider handles every address perfectly. Build a chain that tries cheaper or free providers first and falls back to paid ones:

from geopy.geocoders import Nominatim, GoogleV3, MapBox
from geopy.exc import GeocoderTimedOut, GeocoderServiceError

providers = [
    Nominatim(user_agent="my-production-app", timeout=5),
    MapBox(api_key="pk.xxx", timeout=5),
    GoogleV3(api_key="AIza...", timeout=5),
]

def geocode_with_fallback(address: str):
    for provider in providers:
        try:
            result = provider.geocode(address, exactly_one=True)
            if result:
                return {
                    "lat": result.latitude,
                    "lng": result.longitude,
                    "provider": type(provider).__name__,
                    "raw": result.raw,
                }
        except (GeocoderTimedOut, GeocoderServiceError):
            continue
    return None

Provider-specific tuning

Nominatim returns better results when you add structured query parameters:

result = geolocator.geocode(
    query=None,
    street="221B Baker Street",
    city="London",
    country="UK",
    addressdetails=True,
)
# result.raw["address"]["postcode"] → "NW1 6XE"

Google Maps supports components for disambiguation:

from geopy.geocoders import GoogleV3

google = GoogleV3(api_key="AIza...")
result = google.geocode(
    "Springfield",
    components={"administrative_area": "IL", "country": "US"},
)

Caching layer

Geocoding the same address twice wastes API calls and money. Implement a cache:

SQLite cache

import sqlite3
import json
from hashlib import sha256

class GeoCache:
    def __init__(self, db_path="geocache.db"):
        self.conn = sqlite3.connect(db_path)
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS cache (
                address_hash TEXT PRIMARY KEY,
                address TEXT,
                result TEXT,
                provider TEXT,
                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
        """)

    def get(self, address: str):
        h = sha256(address.lower().strip().encode()).hexdigest()
        row = self.conn.execute(
            "SELECT result FROM cache WHERE address_hash = ?", (h,)
        ).fetchone()
        return json.loads(row[0]) if row else None

    def put(self, address: str, result: dict, provider: str):
        h = sha256(address.lower().strip().encode()).hexdigest()
        self.conn.execute(
            "INSERT OR REPLACE INTO cache VALUES (?, ?, ?, ?, CURRENT_TIMESTAMP)",
            (h, address, json.dumps(result), provider),
        )
        self.conn.commit()

Cache hit rates

In practice, business datasets have high address duplication. A delivery company with 100K daily orders might have only 15K unique addresses. Caching gives you an 85% hit rate, reducing API costs proportionally.

Address normalization

Before geocoding, normalize addresses to improve cache hits and accuracy:

import re

def normalize_address(address: str) -> str:
    address = address.upper().strip()
    # Standardize common abbreviations
    replacements = {
        r"\bST\b": "STREET", r"\bAVE\b": "AVENUE",
        r"\bBLVD\b": "BOULEVARD", r"\bDR\b": "DRIVE",
        r"\bRD\b": "ROAD", r"\bAPT\b": "APARTMENT",
        r"\b#\b": "APARTMENT",
    }
    for pattern, replacement in replacements.items():
        address = re.sub(pattern, replacement, address)
    # Remove extra whitespace
    address = re.sub(r"\s+", " ", address)
    return address

For production systems, consider the usaddress library which parses US addresses into tagged components (street number, street name, city, state, ZIP).

Accuracy validation

Geocoding results vary in precision. A result might be accurate to the building, the street, or just the city. Check the result metadata:

Google precision levels

result = google.geocode("123 Main St, Anytown, US")
precision = result.raw["geometry"]["location_type"]
# ROOFTOP — exact building
# RANGE_INTERPOLATED — estimated between two points
# GEOMETRIC_CENTER — center of a region
# APPROXIMATE — city or area level

Nominatim confidence

result = nominatim.geocode("123 Main St", addressdetails=True)
importance = result.raw.get("importance", 0)
place_rank = result.raw.get("place_rank", 0)
# place_rank 26-30 = building level, 16-18 = city level

Validation rules

def validate_result(result: dict, expected_country: str = None) -> bool:
    if not result:
        return False
    lat, lng = result["lat"], result["lng"]
    # Basic sanity: coordinates within valid range
    if not (-90 <= lat <= 90 and -180 <= lng <= 180):
        return False
    # Not the null island (0, 0) which many services return for failures
    if abs(lat) < 0.1 and abs(lng) < 0.1:
        return False
    # Country validation if available
    if expected_country and result.get("country") != expected_country:
        return False
    return True

High-throughput batch processing

Async geocoding

For large batches, use asyncio with aiohttp to parallelize requests (respecting rate limits):

import asyncio
from geopy.adapters import AioHTTPAdapter
from geopy.geocoders import Nominatim

async def batch_geocode(addresses: list[str], max_concurrent: int = 5):
    semaphore = asyncio.Semaphore(max_concurrent)

    async with Nominatim(
        user_agent="batch-app",
        adapter_factory=AioHTTPAdapter,
    ) as geolocator:
        async def geocode_one(addr):
            async with semaphore:
                await asyncio.sleep(1.0)  # rate limit
                return await geolocator.geocode(addr)

        tasks = [geocode_one(addr) for addr in addresses]
        return await asyncio.gather(*tasks, return_exceptions=True)

Worker pool pattern

For millions of addresses, distribute work across multiple provider accounts:

Partition addresses into chunks.
Assign each chunk to a worker with its own API key and rate limit.
Workers write results to a shared database.
A supervisor monitors progress and retries failures.

With 5 Google API keys at 50 req/sec each, you process 250 addresses per second — 900K per hour.

Distance and routing

Geodesic vs. great-circle

GeoPy’s geodesic() uses the Vincenty formula on the WGS-84 ellipsoid. It is accurate to within 0.5mm but can fail to converge for nearly antipodal points. The great_circle() function uses the Haversine formula on a sphere — less accurate (up to 0.5% error) but always converges.

from geopy.distance import geodesic, great_circle

a = (40.7128, -74.0060)  # New York
b = (34.0522, -118.2437) # Los Angeles

print(geodesic(a, b).km)       # 3944.42
print(great_circle(a, b).km)   # 3936.39  (0.2% difference)

Bounding box queries

To find all points within a radius, compute a bounding box first (fast filter) then refine with geodesic distance (accurate filter):

from geopy.distance import geodesic
from geopy import Point

def bounding_box(center: tuple, radius_km: float):
    north = geodesic(kilometers=radius_km).destination(Point(*center), 0)
    south = geodesic(kilometers=radius_km).destination(Point(*center), 180)
    east = geodesic(kilometers=radius_km).destination(Point(*center), 90)
    west = geodesic(kilometers=radius_km).destination(Point(*center), 270)
    return {
        "min_lat": south.latitude, "max_lat": north.latitude,
        "min_lng": west.longitude, "max_lng": east.longitude,
    }

Use the bounding box as a SQL WHERE clause, then compute exact distances only for the filtered candidates.

Cost optimization

Strategy	Impact
SQLite/Redis cache	70–90% fewer API calls
Address normalization	10–20% more cache hits
Fallback chain (free → paid)	40–60% cost reduction
Batch processing during off-peak	Some providers offer lower rates
Self-hosted Nominatim	Zero per-request cost after setup

For high-volume production (>1M lookups/month), self-hosting a Nominatim instance on a machine with 64GB RAM and the full planet OSM data (~100GB) eliminates per-request costs entirely.

The one thing to remember: Production geocoding is a data pipeline problem — caching, normalization, fallback chains, and validation matter more than which API you call.

pythongeopygeocodinggeospatial