Python Weather Data Analysis — Deep Dive
Technical foundation
Weather data analysis in Python spans data engineering (handling terabyte-scale gridded datasets), meteorological physics (deriving meaningful diagnostics from raw variables), and geospatial computation (coordinate systems, projections, spatial statistics). This deep dive covers the techniques used by operational meteorologists, climate researchers, and weather-sensitive industries.
Loading and exploring ERA5 reanalysis data
ERA5 from ECMWF is the most widely used reanalysis dataset. It provides hourly global coverage at 0.25° resolution (~31 km) from 1940 to present.
import xarray as xr
import numpy as np
# Load a downloaded ERA5 NetCDF file
ds = xr.open_dataset("era5_2024_surface.nc")
print(ds)
# Typical variables: t2m (2m temperature), u10/v10 (10m wind),
# tp (total precipitation), sp (surface pressure), ssrd (solar radiation)
# Select a region (Europe) and time period
europe = ds.sel(
latitude=slice(72, 35), # Note: ERA5 latitude is descending
longitude=slice(-12, 40),
time=slice("2024-06-01", "2024-08-31"),
)
# Convert temperature from Kelvin to Celsius
europe["t2m_c"] = europe["t2m"] - 273.15
Handling large datasets with Dask
ERA5 at full resolution is hundreds of terabytes. Dask enables out-of-core processing:
# Lazy loading — data isn't read until computation is triggered
ds = xr.open_mfdataset(
"era5_*.nc",
chunks={"time": 24, "latitude": 100, "longitude": 100},
engine="netcdf4",
)
# This builds a computation graph but doesn't execute yet
monthly_mean = ds["t2m"].resample(time="1ME").mean()
# Trigger computation and load result into memory
result = monthly_mean.compute()
Chunk sizes significantly affect performance. For time-series analysis at fixed locations, chunk along spatial dimensions. For spatial maps at fixed times, chunk along time.
GRIB format processing
Operational forecast models output GRIB2 format. The cfgrib engine integrates with xarray:
# Read GFS forecast GRIB2 file
ds = xr.open_dataset(
"gfs_forecast.grib2",
engine="cfgrib",
backend_kwargs={"filter_by_keys": {"typeOfLevel": "isobaricInhPa"}},
)
# Select 500 hPa geopotential height (key upper-air diagnostic)
z500 = ds["z"].sel(isobaricInhPa=500)
GRIB files often contain multiple “messages” with different variable types and levels. The filter_by_keys parameter prevents loading the entire file when you need specific variables.
Meteorological calculations with MetPy
MetPy provides unit-aware meteorological functions:
import metpy.calc as mpcalc
from metpy.units import units
# Wind calculations
u_wind = europe["u10"].values * units("m/s")
v_wind = europe["v10"].values * units("m/s")
wind_speed = mpcalc.wind_speed(u_wind, v_wind)
wind_direction = mpcalc.wind_direction(u_wind, v_wind)
# Heat index (apparent temperature)
temperature = europe["t2m_c"].values * units.degC
dewpoint = europe["d2m_c"].values * units.degC
relative_humidity = mpcalc.relative_humidity_from_dewpoint(temperature, dewpoint)
heat_index = mpcalc.heat_index(temperature, relative_humidity)
# Potential temperature (useful for air mass identification)
pressure = europe["sp"].values * units.Pa
theta = mpcalc.potential_temperature(pressure, temperature)
Stability indices for severe weather
# Calculate CAPE (Convective Available Potential Energy)
# Requires a vertical profile (sounding)
pressure_levels = np.array([1000, 925, 850, 700, 500, 300, 200]) * units.hPa
temperature_profile = np.array([25, 20, 15, 5, -15, -40, -55]) * units.degC
dewpoint_profile = np.array([20, 15, 10, -5, -30, -55, -70]) * units.degC
cape, cin = mpcalc.cape_cin(
pressure_levels, temperature_profile, dewpoint_profile
)
print(f"CAPE: {cape:.0f}, CIN: {cin:.0f}")
# CAPE > 1000 J/kg suggests potential for severe thunderstorms
Climatology and anomaly computation
def compute_climatology_and_anomaly(
ds: xr.Dataset,
variable: str,
reference_period: tuple[str, str] = ("1991-01-01", "2020-12-31"),
) -> tuple[xr.DataArray, xr.DataArray]:
"""Compute 30-year daily climatology and anomalies."""
ref = ds[variable].sel(time=slice(*reference_period))
# Daily climatology: average for each day-of-year across reference period
climatology = ref.groupby("time.dayofyear").mean("time")
# Anomaly: deviation from climatology
anomaly = ds[variable].groupby("time.dayofyear") - climatology
return climatology, anomaly
clim, anom = compute_climatology_and_anomaly(ds, "t2m")
# Find the hottest anomalies (heatwave detection)
extreme_heat = anom.where(anom > anom.quantile(0.95), drop=True)
Area-weighted spatial averaging
Grid cells near the poles cover less area than those near the equator. Proper spatial averaging must account for this:
def area_weighted_mean(da: xr.DataArray) -> xr.DataArray:
"""Compute area-weighted spatial mean accounting for latitude."""
weights = np.cos(np.deg2rad(da.latitude))
weights.name = "weights"
return da.weighted(weights).mean(["latitude", "longitude"])
# Global mean temperature time series (correctly weighted)
global_temp = area_weighted_mean(ds["t2m"] - 273.15)
Without area weighting, polar regions (which have many grid cells due to convergence of meridians) are overrepresented, biasing the mean cold.
Extreme value analysis
from scipy.stats import genextreme
def return_period_estimate(
annual_maxima: np.ndarray,
return_periods: list[float] = [10, 25, 50, 100],
) -> dict:
"""Fit GEV distribution and estimate return-period values."""
shape, loc, scale = genextreme.fit(annual_maxima)
results = {}
for rp in return_periods:
# Exceedance probability for return period
p = 1 - 1 / rp
value = genextreme.ppf(p, shape, loc=loc, scale=scale)
results[f"{rp}_year"] = value
results["gev_params"] = {"shape": shape, "location": loc, "scale": scale}
return results
# Example: 100-year wind speed estimate
annual_max_wind = ds["wind_speed"].resample(time="1YE").max().values
estimates = return_period_estimate(annual_max_wind)
print(f"100-year wind speed: {estimates['100_year']:.1f} m/s")
Geospatial visualization with Cartopy
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature
def plot_temperature_map(da: xr.DataArray, title: str):
"""Plot a temperature field on a map projection."""
fig, ax = plt.subplots(
figsize=(12, 8),
subplot_kw={"projection": ccrs.PlateCarree()},
)
# Plot filled contours
im = da.plot.contourf(
ax=ax,
transform=ccrs.PlateCarree(),
levels=20,
cmap="RdYlBu_r",
add_colorbar=True,
cbar_kwargs={"label": "Temperature (°C)", "shrink": 0.8},
)
# Add map features
ax.coastlines(linewidth=0.8)
ax.add_feature(cfeature.BORDERS, linewidth=0.5, linestyle="--")
ax.add_feature(cfeature.RIVERS, linewidth=0.3, color="blue")
ax.gridlines(draw_labels=True, linewidth=0.3, alpha=0.5)
ax.set_title(title, fontsize=14)
plt.tight_layout()
return fig
# Plot summer mean temperature anomaly
summer_anom = anom.sel(time=slice("2024-06-01", "2024-08-31")).mean("time")
plot_temperature_map(summer_anom, "Summer 2024 Temperature Anomaly (°C)")
Building weather-driven analytical pipelines
Production systems combine weather data with business data:
class WeatherEnrichmentPipeline:
"""Enrich business datasets with weather observations."""
def __init__(self, weather_ds: xr.Dataset):
self.weather = weather_ds
def enrich_point_data(
self, df: pd.DataFrame, lat_col: str, lon_col: str,
time_col: str, variables: list[str],
) -> pd.DataFrame:
"""Add weather variables to a dataframe with locations and timestamps."""
for var in variables:
values = []
for _, row in df.iterrows():
val = self.weather[var].sel(
latitude=row[lat_col],
longitude=row[lon_col],
time=row[time_col],
method="nearest",
).values.item()
values.append(val)
df[f"weather_{var}"] = values
return df
def aggregate_for_region(
self, bounds: dict, time_range: tuple[str, str],
variable: str, stat: str = "mean",
) -> pd.Series:
"""Compute regional weather statistics over time."""
subset = self.weather[variable].sel(
latitude=slice(bounds["lat_max"], bounds["lat_min"]),
longitude=slice(bounds["lon_min"], bounds["lon_max"]),
time=slice(*time_range),
)
weights = np.cos(np.deg2rad(subset.latitude))
weighted = subset.weighted(weights)
spatial_agg = getattr(weighted, stat)(["latitude", "longitude"])
return spatial_agg.to_series()
For large-scale enrichment, the row-by-row approach above is too slow. Use xarray’s vectorized .sel() with arrays of coordinates, or pre-extract a station-based subset.
Tradeoffs
| Decision | Option A | Option B |
|---|---|---|
| Data source | Station observations (point-based, ground truth) | Reanalysis/gridded (spatially complete, model-dependent) |
| Resolution | 0.25° ERA5 (global, coarser) | High-res regional models (1–5 km, limited coverage) |
| Processing | In-memory (fast, limited by RAM) | Dask/out-of-core (slower, handles terabytes) |
| Format | NetCDF (self-describing, easy) | Zarr (cloud-native, optimized for parallel access) |
| Temporal scope | Short-term forecast (1–10 days, highest accuracy) | Climate projections (decades, large uncertainty) |
One thing to remember: Weather data analysis is a multi-dimensional data engineering problem — the meteorological calculations are well-understood, but efficiently handling the volume, formats, and coordinate systems of global observational data is where Python’s xarray ecosystem provides irreplaceable value.
See Also
- Python Building Energy Simulation Discover how Python helps architects and engineers predict a building's energy use before a single brick is laid.
- Python Carbon Footprint Tracking See how Python helps people and companies measure and reduce the pollution they create every day.
- Python Climate Model Visualization See how Python turns complex climate predictions into colorful maps and charts that help everyone understand our changing planet.
- Python Energy Consumption Modeling Understand how Python helps predict and manage energy use, explained with everyday examples anyone can follow.
- Python Smart Grid Simulation Find out how Python helps engineers test the power grid of the future without risking a single blackout.