Python Soil Analysis — Deep Dive

Build Python pipelines for soil spectroscopy calibration, digital soil mapping with machine learning, and geostatistical spatial prediction of soil properties.

Architecture of a soil analysis pipeline

A complete soil analysis system covers three domains: spectral prediction (lab-scale), spatial mapping (landscape-scale), and temporal monitoring (multi-year tracking). Python handles all three with different library combinations.

Lab Samples → Spectroscopy → Calibration Model → Rapid Predictions
    + Covariates (DEM, Climate, Satellite) → Digital Soil Mapping
        → Field-level property maps → Management recommendations

Spectral preprocessing

Raw soil spectra require preprocessing to remove noise and enhance chemically relevant features:

import numpy as np
from scipy.signal import savgol_filter

def preprocess_spectra(
    spectra: np.ndarray,
    wavelengths: np.ndarray,
    method: str = "snv_sg1"
) -> np.ndarray:
    """Preprocess soil reflectance spectra.
    
    Args:
        spectra: shape (n_samples, n_wavelengths)
        wavelengths: shape (n_wavelengths,)
        method: preprocessing pipeline name
    """
    processed = spectra.copy()
    
    if "snv" in method:
        # Standard Normal Variate — corrects for scatter effects
        means = processed.mean(axis=1, keepdims=True)
        stds = processed.std(axis=1, keepdims=True)
        processed = (processed - means) / (stds + 1e-10)
    
    if "sg1" in method:
        # Savitzky-Golay 1st derivative — enhances absorption features
        processed = savgol_filter(
            processed, window_length=11, polyorder=2,
            deriv=1, axis=1
        )
    elif "sg2" in method:
        # 2nd derivative — sharper feature resolution
        processed = savgol_filter(
            processed, window_length=17, polyorder=2,
            deriv=2, axis=1
        )
    
    if "cr" in method:
        # Continuum removal — isolates individual absorption features
        processed = continuum_removal(processed, wavelengths)
    
    return processed

def continuum_removal(spectra: np.ndarray, wavelengths: np.ndarray) -> np.ndarray:
    """Remove spectral continuum using convex hull."""
    from scipy.spatial import ConvexHull
    
    result = np.zeros_like(spectra)
    for i in range(spectra.shape[0]):
        points = np.column_stack([wavelengths, spectra[i]])
        hull = ConvexHull(points)
        # Interpolate upper hull as continuum
        hull_vertices = sorted(
            [v for v in hull.vertices if spectra[i, v] >= np.median(spectra[i])]
        )
        continuum = np.interp(wavelengths, wavelengths[hull_vertices], spectra[i, hull_vertices])
        result[i] = spectra[i] / (continuum + 1e-10)
    
    return result

SNV + first derivative is the most common preprocessing chain for NIR soil spectra. The derivative removes baseline offsets, and SNV corrects for differences in particle size and surface roughness between samples.

Partial Least Squares calibration

PLS regression is the workhorse for soil spectral calibration because it handles the multicollinearity inherent in spectral data:

from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

def calibrate_pls(
    X_spectra: np.ndarray,
    y_property: np.ndarray,
    max_components: int = 20,
    cv_folds: int = 10
) -> dict:
    """Find optimal PLS components and return calibration metrics."""
    
    best_r2 = -np.inf
    best_n = 1
    
    kf = KFold(n_splits=cv_folds, shuffle=True, random_state=42)
    
    for n_comp in range(1, max_components + 1):
        pls = PLSRegression(n_components=n_comp)
        y_pred = cross_val_predict(pls, X_spectra, y_property, cv=kf)
        r2 = r2_score(y_property, y_pred)
        
        if r2 > best_r2:
            best_r2 = r2
            best_n = n_comp
    
    # Final model with optimal components
    final_pls = PLSRegression(n_components=best_n)
    y_cv = cross_val_predict(final_pls, X_spectra, y_property, cv=kf)
    final_pls.fit(X_spectra, y_property)
    
    rmse = np.sqrt(mean_squared_error(y_property, y_cv))
    rpd = y_property.std() / rmse  # Ratio of Performance to Deviation
    
    return {
        "model": final_pls,
        "n_components": best_n,
        "r2_cv": r2_score(y_property, y_cv),
        "rmse_cv": rmse,
        "rpd": rpd,  # >2.0 = good, >2.5 = excellent
        "y_predicted": y_cv
    }

RPD (Ratio of Performance to Deviation) is the standard quality metric in soil spectroscopy. RPD > 2.0 indicates useful predictions; RPD > 2.5 indicates excellent predictions suitable for quantitative mapping. Typical RPD values by property: organic carbon 2.5-4.0, clay content 2.0-3.5, pH 1.5-2.5, available phosphorus 1.2-2.0 (harder to predict spectrally).

Digital soil mapping with environmental covariates

Once point predictions are available, spatial mapping extrapolates across the landscape:

import rasterio
import geopandas as gpd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
import numpy as np

def extract_covariates_at_points(
    sample_points: gpd.GeoDataFrame,
    covariate_paths: dict[str, str]
) -> gpd.GeoDataFrame:
    """Extract raster covariate values at sample locations."""
    for name, raster_path in covariate_paths.items():
        with rasterio.open(raster_path) as src:
            coords = [
                (geom.x, geom.y)
                for geom in sample_points.geometry
            ]
            values = [val[0] for val in src.sample(coords)]
            sample_points[name] = values
    return sample_points

def build_soil_map(
    samples: gpd.GeoDataFrame,
    target_property: str,
    covariate_names: list[str],
    covariate_paths: dict[str, str],
    output_path: str
):
    """Predict soil property across landscape from covariates."""
    X = samples[covariate_names].values
    y = samples[target_property].values
    
    # Remove samples with missing covariates
    valid = ~np.isnan(X).any(axis=1) & ~np.isnan(y)
    X, y = X[valid], y[valid]
    
    model = RandomForestRegressor(
        n_estimators=500, max_depth=15,
        min_samples_leaf=5, oob_score=True,
        random_state=42, n_jobs=-1
    )
    
    scores = cross_val_score(model, X, y, cv=10, scoring="r2")
    print(f"CV R²: {scores.mean():.3f} ± {scores.std():.3f}")
    
    model.fit(X, y)
    print(f"OOB R²: {model.oob_score_:.3f}")
    
    # Predict across full raster extent
    with rasterio.open(covariate_paths[covariate_names[0]]) as template:
        profile = template.profile.copy()
        profile.update(dtype="float32", count=1)
        
        # Read all covariates as stacked array
        stack = []
        for name in covariate_names:
            with rasterio.open(covariate_paths[name]) as src:
                stack.append(src.read(1))
        
        covariate_stack = np.stack(stack, axis=-1)
        shape_2d = covariate_stack.shape[:2]
        flat = covariate_stack.reshape(-1, len(covariate_names))
        
        valid_pixels = ~np.isnan(flat).any(axis=1)
        predictions = np.full(flat.shape[0], np.nan, dtype=np.float32)
        predictions[valid_pixels] = model.predict(flat[valid_pixels])
        
        prediction_map = predictions.reshape(shape_2d)
    
    with rasterio.open(output_path, "w", **profile) as dst:
        dst.write(prediction_map, 1)
    
    return model

Common covariates include DEM-derived terrain attributes (slope, aspect, curvature, topographic wetness index), satellite indices (NDVI, bare soil index), climate layers (mean annual temperature, precipitation), and parent material maps.

Uncertainty quantification

Soil maps without uncertainty estimates are incomplete. Quantile regression forests provide pixel-level prediction intervals:

from sklearn.ensemble import RandomForestRegressor

def predict_with_uncertainty(
    model: RandomForestRegressor,
    X: np.ndarray,
    confidence: float = 0.90
) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    """Get prediction intervals from individual tree predictions."""
    tree_predictions = np.array([
        tree.predict(X) for tree in model.estimators_
    ])
    
    lower_q = (1 - confidence) / 2 * 100
    upper_q = (1 + confidence) / 2 * 100
    
    median = np.percentile(tree_predictions, 50, axis=0)
    lower = np.percentile(tree_predictions, lower_q, axis=0)
    upper = np.percentile(tree_predictions, upper_q, axis=0)
    
    return median, lower, upper

Wide prediction intervals indicate areas where sampling density is low or covariate space is poorly represented — guiding where additional samples would be most valuable.

Geostatistical interpolation

When covariates are unavailable or the goal is purely spatial prediction, kriging provides optimal interpolation with uncertainty:

from pykrige.ok import OrdinaryKriging
import numpy as np

def krige_soil_property(
    x: np.ndarray, y: np.ndarray, values: np.ndarray,
    grid_x: np.ndarray, grid_y: np.ndarray,
    variogram_model: str = "spherical"
) -> tuple[np.ndarray, np.ndarray]:
    """Ordinary kriging with automatic variogram fitting."""
    ok = OrdinaryKriging(
        x, y, values,
        variogram_model=variogram_model,
        verbose=False,
        enable_plotting=False,
        nlags=20
    )
    
    prediction, variance = ok.execute("grid", grid_x, grid_y)
    return prediction, np.sqrt(variance)  # Return std dev

Tradeoffs

Spectroscopy vs. wet chemistry: Spectra predict organic carbon and texture well (RPD > 2.5) but struggle with trace elements and available nutrients (RPD < 1.5). For regulatory compliance, wet chemistry remains the standard.

Sampling density: Digital soil mapping accuracy improves logarithmically with sample count — doubling samples from 50 to 100 helps a lot; doubling from 500 to 1000 helps much less. The optimal density depends on landscape complexity, typically 1-5 samples per hectare for farm-scale mapping.

Temporal stability: pH, texture, and parent material change slowly (decades). Nitrogen and moisture change daily. Mapping strategies must match the temporal dynamics of the target property.

Global vs. local calibrations: Large spectral libraries (e.g., the OSSL with 180,000+ samples) enable global calibration models, but locally calibrated models almost always outperform them. Memory-based learning (selecting similar spectra from a global library for local prediction) offers a middle ground.

One thing to remember: Soil analysis in Python combines spectroscopy for rapid property prediction, machine learning for spatial mapping, and geostatistics for uncertainty quantification — the challenge is matching the right technique to each property’s predictability and the decision’s required accuracy.

pythonagriculturedata-scienceenvironmental-sciencegeospatial