Python Soil Analysis — Deep Dive
Architecture of a soil analysis pipeline
A complete soil analysis system covers three domains: spectral prediction (lab-scale), spatial mapping (landscape-scale), and temporal monitoring (multi-year tracking). Python handles all three with different library combinations.
Lab Samples → Spectroscopy → Calibration Model → Rapid Predictions
+ Covariates (DEM, Climate, Satellite) → Digital Soil Mapping
→ Field-level property maps → Management recommendations
Spectral preprocessing
Raw soil spectra require preprocessing to remove noise and enhance chemically relevant features:
import numpy as np
from scipy.signal import savgol_filter
def preprocess_spectra(
spectra: np.ndarray,
wavelengths: np.ndarray,
method: str = "snv_sg1"
) -> np.ndarray:
"""Preprocess soil reflectance spectra.
Args:
spectra: shape (n_samples, n_wavelengths)
wavelengths: shape (n_wavelengths,)
method: preprocessing pipeline name
"""
processed = spectra.copy()
if "snv" in method:
# Standard Normal Variate — corrects for scatter effects
means = processed.mean(axis=1, keepdims=True)
stds = processed.std(axis=1, keepdims=True)
processed = (processed - means) / (stds + 1e-10)
if "sg1" in method:
# Savitzky-Golay 1st derivative — enhances absorption features
processed = savgol_filter(
processed, window_length=11, polyorder=2,
deriv=1, axis=1
)
elif "sg2" in method:
# 2nd derivative — sharper feature resolution
processed = savgol_filter(
processed, window_length=17, polyorder=2,
deriv=2, axis=1
)
if "cr" in method:
# Continuum removal — isolates individual absorption features
processed = continuum_removal(processed, wavelengths)
return processed
def continuum_removal(spectra: np.ndarray, wavelengths: np.ndarray) -> np.ndarray:
"""Remove spectral continuum using convex hull."""
from scipy.spatial import ConvexHull
result = np.zeros_like(spectra)
for i in range(spectra.shape[0]):
points = np.column_stack([wavelengths, spectra[i]])
hull = ConvexHull(points)
# Interpolate upper hull as continuum
hull_vertices = sorted(
[v for v in hull.vertices if spectra[i, v] >= np.median(spectra[i])]
)
continuum = np.interp(wavelengths, wavelengths[hull_vertices], spectra[i, hull_vertices])
result[i] = spectra[i] / (continuum + 1e-10)
return result
SNV + first derivative is the most common preprocessing chain for NIR soil spectra. The derivative removes baseline offsets, and SNV corrects for differences in particle size and surface roughness between samples.
Partial Least Squares calibration
PLS regression is the workhorse for soil spectral calibration because it handles the multicollinearity inherent in spectral data:
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
def calibrate_pls(
X_spectra: np.ndarray,
y_property: np.ndarray,
max_components: int = 20,
cv_folds: int = 10
) -> dict:
"""Find optimal PLS components and return calibration metrics."""
best_r2 = -np.inf
best_n = 1
kf = KFold(n_splits=cv_folds, shuffle=True, random_state=42)
for n_comp in range(1, max_components + 1):
pls = PLSRegression(n_components=n_comp)
y_pred = cross_val_predict(pls, X_spectra, y_property, cv=kf)
r2 = r2_score(y_property, y_pred)
if r2 > best_r2:
best_r2 = r2
best_n = n_comp
# Final model with optimal components
final_pls = PLSRegression(n_components=best_n)
y_cv = cross_val_predict(final_pls, X_spectra, y_property, cv=kf)
final_pls.fit(X_spectra, y_property)
rmse = np.sqrt(mean_squared_error(y_property, y_cv))
rpd = y_property.std() / rmse # Ratio of Performance to Deviation
return {
"model": final_pls,
"n_components": best_n,
"r2_cv": r2_score(y_property, y_cv),
"rmse_cv": rmse,
"rpd": rpd, # >2.0 = good, >2.5 = excellent
"y_predicted": y_cv
}
RPD (Ratio of Performance to Deviation) is the standard quality metric in soil spectroscopy. RPD > 2.0 indicates useful predictions; RPD > 2.5 indicates excellent predictions suitable for quantitative mapping. Typical RPD values by property: organic carbon 2.5-4.0, clay content 2.0-3.5, pH 1.5-2.5, available phosphorus 1.2-2.0 (harder to predict spectrally).
Digital soil mapping with environmental covariates
Once point predictions are available, spatial mapping extrapolates across the landscape:
import rasterio
import geopandas as gpd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
import numpy as np
def extract_covariates_at_points(
sample_points: gpd.GeoDataFrame,
covariate_paths: dict[str, str]
) -> gpd.GeoDataFrame:
"""Extract raster covariate values at sample locations."""
for name, raster_path in covariate_paths.items():
with rasterio.open(raster_path) as src:
coords = [
(geom.x, geom.y)
for geom in sample_points.geometry
]
values = [val[0] for val in src.sample(coords)]
sample_points[name] = values
return sample_points
def build_soil_map(
samples: gpd.GeoDataFrame,
target_property: str,
covariate_names: list[str],
covariate_paths: dict[str, str],
output_path: str
):
"""Predict soil property across landscape from covariates."""
X = samples[covariate_names].values
y = samples[target_property].values
# Remove samples with missing covariates
valid = ~np.isnan(X).any(axis=1) & ~np.isnan(y)
X, y = X[valid], y[valid]
model = RandomForestRegressor(
n_estimators=500, max_depth=15,
min_samples_leaf=5, oob_score=True,
random_state=42, n_jobs=-1
)
scores = cross_val_score(model, X, y, cv=10, scoring="r2")
print(f"CV R²: {scores.mean():.3f} ± {scores.std():.3f}")
model.fit(X, y)
print(f"OOB R²: {model.oob_score_:.3f}")
# Predict across full raster extent
with rasterio.open(covariate_paths[covariate_names[0]]) as template:
profile = template.profile.copy()
profile.update(dtype="float32", count=1)
# Read all covariates as stacked array
stack = []
for name in covariate_names:
with rasterio.open(covariate_paths[name]) as src:
stack.append(src.read(1))
covariate_stack = np.stack(stack, axis=-1)
shape_2d = covariate_stack.shape[:2]
flat = covariate_stack.reshape(-1, len(covariate_names))
valid_pixels = ~np.isnan(flat).any(axis=1)
predictions = np.full(flat.shape[0], np.nan, dtype=np.float32)
predictions[valid_pixels] = model.predict(flat[valid_pixels])
prediction_map = predictions.reshape(shape_2d)
with rasterio.open(output_path, "w", **profile) as dst:
dst.write(prediction_map, 1)
return model
Common covariates include DEM-derived terrain attributes (slope, aspect, curvature, topographic wetness index), satellite indices (NDVI, bare soil index), climate layers (mean annual temperature, precipitation), and parent material maps.
Uncertainty quantification
Soil maps without uncertainty estimates are incomplete. Quantile regression forests provide pixel-level prediction intervals:
from sklearn.ensemble import RandomForestRegressor
def predict_with_uncertainty(
model: RandomForestRegressor,
X: np.ndarray,
confidence: float = 0.90
) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
"""Get prediction intervals from individual tree predictions."""
tree_predictions = np.array([
tree.predict(X) for tree in model.estimators_
])
lower_q = (1 - confidence) / 2 * 100
upper_q = (1 + confidence) / 2 * 100
median = np.percentile(tree_predictions, 50, axis=0)
lower = np.percentile(tree_predictions, lower_q, axis=0)
upper = np.percentile(tree_predictions, upper_q, axis=0)
return median, lower, upper
Wide prediction intervals indicate areas where sampling density is low or covariate space is poorly represented — guiding where additional samples would be most valuable.
Geostatistical interpolation
When covariates are unavailable or the goal is purely spatial prediction, kriging provides optimal interpolation with uncertainty:
from pykrige.ok import OrdinaryKriging
import numpy as np
def krige_soil_property(
x: np.ndarray, y: np.ndarray, values: np.ndarray,
grid_x: np.ndarray, grid_y: np.ndarray,
variogram_model: str = "spherical"
) -> tuple[np.ndarray, np.ndarray]:
"""Ordinary kriging with automatic variogram fitting."""
ok = OrdinaryKriging(
x, y, values,
variogram_model=variogram_model,
verbose=False,
enable_plotting=False,
nlags=20
)
prediction, variance = ok.execute("grid", grid_x, grid_y)
return prediction, np.sqrt(variance) # Return std dev
Tradeoffs
Spectroscopy vs. wet chemistry: Spectra predict organic carbon and texture well (RPD > 2.5) but struggle with trace elements and available nutrients (RPD < 1.5). For regulatory compliance, wet chemistry remains the standard.
Sampling density: Digital soil mapping accuracy improves logarithmically with sample count — doubling samples from 50 to 100 helps a lot; doubling from 500 to 1000 helps much less. The optimal density depends on landscape complexity, typically 1-5 samples per hectare for farm-scale mapping.
Temporal stability: pH, texture, and parent material change slowly (decades). Nitrogen and moisture change daily. Mapping strategies must match the temporal dynamics of the target property.
Global vs. local calibrations: Large spectral libraries (e.g., the OSSL with 180,000+ samples) enable global calibration models, but locally calibrated models almost always outperform them. Memory-based learning (selecting similar spectra from a global library for local prediction) offers a middle ground.
One thing to remember: Soil analysis in Python combines spectroscopy for rapid property prediction, machine learning for spatial mapping, and geostatistics for uncertainty quantification — the challenge is matching the right technique to each property’s predictability and the decision’s required accuracy.
See Also
- Python Biodiversity Tracking How Python helps scientists count and protect every kind of animal and plant on Earth — from whales to wildflowers.
- Python Crop Disease Detection How Python looks at photos of plants and figures out if they're sick — like a doctor for crops.
- Python Deforestation Detection How Python spots disappearing forests from space — catching illegal logging and land clearing as it happens.
- Python Drone Image Processing How Python turns hundreds of overlapping drone photos into detailed maps and 3D models of the ground below.
- Python Ocean Data Analysis How Python explores the world's oceans through data — tracking currents, temperatures, and marine life without getting wet.