Xarray for Multidimensional Data — Core Concepts
Why Xarray matters
NumPy arrays are powerful but anonymous — data[3, :, 5] tells you nothing about what dimension 0 or dimension 2 represents. Is that time? Latitude? Wavelength? You have to remember, and if you forget or get axes swapped, the math still runs but gives wrong answers silently.
Xarray solves this by attaching names and coordinates to every dimension. Instead of data[3, :, 5], you write data.sel(time="2026-01-15", longitude=-74). The code is self-documenting and harder to get wrong.
Xarray is the standard tool for climate science, oceanography, atmospheric research, and remote sensing. It is used by NOAA, NASA, the European Centre for Medium-Range Weather Forecasts (ECMWF), and Copernicus Climate Change Service.
Core data structures
DataArray
A DataArray is a single labeled N-dimensional array. It has:
- values — the underlying NumPy (or Dask) array of numbers
- dims — named dimensions (e.g., “time”, “latitude”, “longitude”)
- coords — coordinate labels for each dimension (e.g., actual dates, lat/lon values)
- attrs — metadata (units, description, source)
Think of it as a NumPy array with a dictionary of axis labels attached.
Dataset
A Dataset is a collection of DataArray objects that share dimensions. A climate dataset might contain temperature, humidity, and wind speed — all indexed by the same time, latitude, and longitude coordinates. It is analogous to a pandas DataFrame, but for N-dimensional data.
Key operations
Selection by label
Instead of remembering axis positions, select by coordinate value:
.sel()— exact label matching.ds.sel(time="2026-03-15")gets data for a specific date..isel()— integer position indexing.ds.isel(time=0)gets the first time step.- Slicing —
ds.sel(latitude=slice(30, 50))selects a latitude range. - Nearest —
ds.sel(latitude=37.5, method="nearest")finds the closest coordinate.
Aggregation along dimensions
Reduce data by name: ds.mean(dim="time") averages over time, collapsing that dimension. You never need to remember whether time is axis 0, 1, or 2.
Available reductions: mean, sum, std, min, max, median, count, and more. All accept a dim parameter.
GroupBy
Group data by coordinate values and apply operations. ds.groupby("time.month").mean() computes monthly averages across years — a one-liner that would take loops and bookkeeping in raw NumPy.
Broadcasting and alignment
Xarray automatically aligns arrays by dimension name. If you add a 2D array (time × latitude) to a 3D array (time × latitude × longitude), Xarray broadcasts correctly because it matches by name, not by position.
File format: NetCDF
Xarray’s native file format is NetCDF — the standard for scientific multidimensional data. NetCDF files store arrays alongside their dimension names, coordinates, and metadata in a single self-describing file. Xarray reads and writes NetCDF with open_dataset() and .to_netcdf().
Other supported formats include HDF5, Zarr (cloud-optimized), GRIB (weather forecasts), and FITS (astronomy).
A typical workflow
- Load — Open a NetCDF file or a collection of files (
open_mfdatasetfor multiple files). - Inspect — Print the dataset to see dimensions, coordinates, and variables.
- Select — Slice a region and time range of interest.
- Compute — Calculate statistics: means, anomalies, trends.
- Visualize — Plot with built-in
.plot()method (wraps matplotlib). - Save — Write results to NetCDF or Zarr for sharing.
Common misconception
People sometimes try to use pandas for multidimensional data by reshaping it into a flat table with MultiIndex columns. This works for small data but becomes slow and confusing at scale. Xarray is purpose-built for data that is naturally N-dimensional — using it avoids the mental gymnastics of flattening and unflattening.
Where Xarray fits in the ecosystem
- NumPy — raw N-dimensional arrays, no labels
- pandas — labeled 1D (Series) and 2D (DataFrame) data
- Xarray — labeled N-dimensional data (extends the pandas concept to higher dimensions)
- Dask — parallel computing; Xarray integrates with Dask for datasets too large for memory
The one thing to remember: Xarray adds names and coordinates to NumPy arrays, turning error-prone axis-number indexing into readable, self-documenting label-based operations for multidimensional scientific data.