Python Weather Data Analysis — Deep Dive

Master advanced Python techniques for processing gridded meteorological datasets, computing derived diagnostics, and building weather-driven analytical pipelines.

Technical foundation

Weather data analysis in Python spans data engineering (handling terabyte-scale gridded datasets), meteorological physics (deriving meaningful diagnostics from raw variables), and geospatial computation (coordinate systems, projections, spatial statistics). This deep dive covers the techniques used by operational meteorologists, climate researchers, and weather-sensitive industries.

Loading and exploring ERA5 reanalysis data

ERA5 from ECMWF is the most widely used reanalysis dataset. It provides hourly global coverage at 0.25° resolution (~31 km) from 1940 to present.

import xarray as xr
import numpy as np

# Load a downloaded ERA5 NetCDF file
ds = xr.open_dataset("era5_2024_surface.nc")
print(ds)
# Typical variables: t2m (2m temperature), u10/v10 (10m wind),
# tp (total precipitation), sp (surface pressure), ssrd (solar radiation)

# Select a region (Europe) and time period
europe = ds.sel(
    latitude=slice(72, 35),   # Note: ERA5 latitude is descending
    longitude=slice(-12, 40),
    time=slice("2024-06-01", "2024-08-31"),
)

# Convert temperature from Kelvin to Celsius
europe["t2m_c"] = europe["t2m"] - 273.15

Handling large datasets with Dask

ERA5 at full resolution is hundreds of terabytes. Dask enables out-of-core processing:

# Lazy loading — data isn't read until computation is triggered
ds = xr.open_mfdataset(
    "era5_*.nc",
    chunks={"time": 24, "latitude": 100, "longitude": 100},
    engine="netcdf4",
)

# This builds a computation graph but doesn't execute yet
monthly_mean = ds["t2m"].resample(time="1ME").mean()

# Trigger computation and load result into memory
result = monthly_mean.compute()

Chunk sizes significantly affect performance. For time-series analysis at fixed locations, chunk along spatial dimensions. For spatial maps at fixed times, chunk along time.

GRIB format processing

Operational forecast models output GRIB2 format. The cfgrib engine integrates with xarray:

# Read GFS forecast GRIB2 file
ds = xr.open_dataset(
    "gfs_forecast.grib2",
    engine="cfgrib",
    backend_kwargs={"filter_by_keys": {"typeOfLevel": "isobaricInhPa"}},
)

# Select 500 hPa geopotential height (key upper-air diagnostic)
z500 = ds["z"].sel(isobaricInhPa=500)

GRIB files often contain multiple “messages” with different variable types and levels. The filter_by_keys parameter prevents loading the entire file when you need specific variables.

Meteorological calculations with MetPy

MetPy provides unit-aware meteorological functions:

import metpy.calc as mpcalc
from metpy.units import units

# Wind calculations
u_wind = europe["u10"].values * units("m/s")
v_wind = europe["v10"].values * units("m/s")

wind_speed = mpcalc.wind_speed(u_wind, v_wind)
wind_direction = mpcalc.wind_direction(u_wind, v_wind)

# Heat index (apparent temperature)
temperature = europe["t2m_c"].values * units.degC
dewpoint = europe["d2m_c"].values * units.degC
relative_humidity = mpcalc.relative_humidity_from_dewpoint(temperature, dewpoint)
heat_index = mpcalc.heat_index(temperature, relative_humidity)

# Potential temperature (useful for air mass identification)
pressure = europe["sp"].values * units.Pa
theta = mpcalc.potential_temperature(pressure, temperature)

Stability indices for severe weather

# Calculate CAPE (Convective Available Potential Energy)
# Requires a vertical profile (sounding)
pressure_levels = np.array([1000, 925, 850, 700, 500, 300, 200]) * units.hPa
temperature_profile = np.array([25, 20, 15, 5, -15, -40, -55]) * units.degC
dewpoint_profile = np.array([20, 15, 10, -5, -30, -55, -70]) * units.degC

cape, cin = mpcalc.cape_cin(
    pressure_levels, temperature_profile, dewpoint_profile
)
print(f"CAPE: {cape:.0f}, CIN: {cin:.0f}")
# CAPE > 1000 J/kg suggests potential for severe thunderstorms

Climatology and anomaly computation

def compute_climatology_and_anomaly(
    ds: xr.Dataset,
    variable: str,
    reference_period: tuple[str, str] = ("1991-01-01", "2020-12-31"),
) -> tuple[xr.DataArray, xr.DataArray]:
    """Compute 30-year daily climatology and anomalies."""
    ref = ds[variable].sel(time=slice(*reference_period))

    # Daily climatology: average for each day-of-year across reference period
    climatology = ref.groupby("time.dayofyear").mean("time")

    # Anomaly: deviation from climatology
    anomaly = ds[variable].groupby("time.dayofyear") - climatology

    return climatology, anomaly

clim, anom = compute_climatology_and_anomaly(ds, "t2m")

# Find the hottest anomalies (heatwave detection)
extreme_heat = anom.where(anom > anom.quantile(0.95), drop=True)

Area-weighted spatial averaging

Grid cells near the poles cover less area than those near the equator. Proper spatial averaging must account for this:

def area_weighted_mean(da: xr.DataArray) -> xr.DataArray:
    """Compute area-weighted spatial mean accounting for latitude."""
    weights = np.cos(np.deg2rad(da.latitude))
    weights.name = "weights"
    return da.weighted(weights).mean(["latitude", "longitude"])

# Global mean temperature time series (correctly weighted)
global_temp = area_weighted_mean(ds["t2m"] - 273.15)

Without area weighting, polar regions (which have many grid cells due to convergence of meridians) are overrepresented, biasing the mean cold.

Extreme value analysis

from scipy.stats import genextreme

def return_period_estimate(
    annual_maxima: np.ndarray,
    return_periods: list[float] = [10, 25, 50, 100],
) -> dict:
    """Fit GEV distribution and estimate return-period values."""
    shape, loc, scale = genextreme.fit(annual_maxima)

    results = {}
    for rp in return_periods:
        # Exceedance probability for return period
        p = 1 - 1 / rp
        value = genextreme.ppf(p, shape, loc=loc, scale=scale)
        results[f"{rp}_year"] = value

    results["gev_params"] = {"shape": shape, "location": loc, "scale": scale}
    return results

# Example: 100-year wind speed estimate
annual_max_wind = ds["wind_speed"].resample(time="1YE").max().values
estimates = return_period_estimate(annual_max_wind)
print(f"100-year wind speed: {estimates['100_year']:.1f} m/s")

Geospatial visualization with Cartopy

import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature

def plot_temperature_map(da: xr.DataArray, title: str):
    """Plot a temperature field on a map projection."""
    fig, ax = plt.subplots(
        figsize=(12, 8),
        subplot_kw={"projection": ccrs.PlateCarree()},
    )

    # Plot filled contours
    im = da.plot.contourf(
        ax=ax,
        transform=ccrs.PlateCarree(),
        levels=20,
        cmap="RdYlBu_r",
        add_colorbar=True,
        cbar_kwargs={"label": "Temperature (°C)", "shrink": 0.8},
    )

    # Add map features
    ax.coastlines(linewidth=0.8)
    ax.add_feature(cfeature.BORDERS, linewidth=0.5, linestyle="--")
    ax.add_feature(cfeature.RIVERS, linewidth=0.3, color="blue")
    ax.gridlines(draw_labels=True, linewidth=0.3, alpha=0.5)

    ax.set_title(title, fontsize=14)
    plt.tight_layout()
    return fig

# Plot summer mean temperature anomaly
summer_anom = anom.sel(time=slice("2024-06-01", "2024-08-31")).mean("time")
plot_temperature_map(summer_anom, "Summer 2024 Temperature Anomaly (°C)")

Building weather-driven analytical pipelines

Production systems combine weather data with business data:

class WeatherEnrichmentPipeline:
    """Enrich business datasets with weather observations."""

    def __init__(self, weather_ds: xr.Dataset):
        self.weather = weather_ds

    def enrich_point_data(
        self, df: pd.DataFrame, lat_col: str, lon_col: str,
        time_col: str, variables: list[str],
    ) -> pd.DataFrame:
        """Add weather variables to a dataframe with locations and timestamps."""
        for var in variables:
            values = []
            for _, row in df.iterrows():
                val = self.weather[var].sel(
                    latitude=row[lat_col],
                    longitude=row[lon_col],
                    time=row[time_col],
                    method="nearest",
                ).values.item()
                values.append(val)
            df[f"weather_{var}"] = values
        return df

    def aggregate_for_region(
        self, bounds: dict, time_range: tuple[str, str],
        variable: str, stat: str = "mean",
    ) -> pd.Series:
        """Compute regional weather statistics over time."""
        subset = self.weather[variable].sel(
            latitude=slice(bounds["lat_max"], bounds["lat_min"]),
            longitude=slice(bounds["lon_min"], bounds["lon_max"]),
            time=slice(*time_range),
        )
        weights = np.cos(np.deg2rad(subset.latitude))
        weighted = subset.weighted(weights)
        spatial_agg = getattr(weighted, stat)(["latitude", "longitude"])
        return spatial_agg.to_series()

For large-scale enrichment, the row-by-row approach above is too slow. Use xarray’s vectorized .sel() with arrays of coordinates, or pre-extract a station-based subset.

Tradeoffs

Decision	Option A	Option B
Data source	Station observations (point-based, ground truth)	Reanalysis/gridded (spatially complete, model-dependent)
Resolution	0.25° ERA5 (global, coarser)	High-res regional models (1–5 km, limited coverage)
Processing	In-memory (fast, limited by RAM)	Dask/out-of-core (slower, handles terabytes)
Format	NetCDF (self-describing, easy)	Zarr (cloud-native, optimized for parallel access)
Temporal scope	Short-term forecast (1–10 days, highest accuracy)	Climate projections (decades, large uncertainty)

One thing to remember: Weather data analysis is a multi-dimensional data engineering problem — the meteorological calculations are well-understood, but efficiently handling the volume, formats, and coordinate systems of global observational data is where Python’s xarray ecosystem provides irreplaceable value.

pythonweatherdata-scienceclimate