Image Classification Workflows in Python GIS

Image classification converts raw pixel reflectance values into thematic land cover maps. Within the broader discipline of Remote Sensing & Raster Analysis, this workflow serves as the primary bridge between multispectral satellite observations and actionable geospatial intelligence. Whether mapping deforestation, tracking urban expansion, or monitoring crop health, classification relies on algorithms that interpret spectral signatures and spatial patterns. Supervised methods require human-labeled training samples, while unsupervised techniques cluster pixels based on statistical similarity. The reliability of your final map hinges on data quality, feature selection, and algorithmic rigor.

flowchart LR
    A["Data prep<br/>& alignment"] --> B["Feature engineering<br/>(indices, texture)"]
    B --> C["Train model<br/>& predict"]
    C --> D["Accuracy<br/>assessment"]
    D --> E["Post-process<br/>& vectorize"]

Step 1: Data Preparation and Alignment

A robust classification pipeline begins with spatially and spectrally consistent raster data. Satellite imagery often arrives in separate band files with varying projections, resolutions, and cloud contamination. Before feeding pixels into a model, you must align all inputs to a common coordinate reference system (CRS), resample mismatched grids, and mask invalid values. Proper preprocessing eliminates geometric artifacts that could mislead machine learning algorithms. For comprehensive techniques on ingesting, masking, and harmonizing satellite products, consult the guide to Reading and Processing Satellite Imagery.

import rasterio
import numpy as np

# Stack multiple bands into a single 3D array (bands, rows, cols)
band_paths = ["b2.tif", "b3.tif", "b4.tif", "b8.tif"]
with rasterio.open(band_paths[0]) as src:
    profile = src.profile.copy()
    height, width = src.height, src.width

# Read and stack bands efficiently
stacked = np.stack([rasterio.open(p).read(1) for p in band_paths])
# Replace cloud/shadow values (e.g., 0) with NaN to exclude from training
stacked[stacked == 0] = np.nan

Step 2: Feature Engineering

Raw spectral bands capture reflectance, but derived features often reveal ecological and physical patterns more clearly. Calculating indices like the Normalized Difference Vegetation Index (NDVI) or Normalized Difference Water Index (NDWI) enhances class separability by emphasizing specific biophysical properties. You can also compute textural metrics (e.g., local variance or entropy) to capture spatial heterogeneity. These operations transform your dataset into a multidimensional feature space optimized for classification. When combining original bands with calculated indices, you are effectively performing targeted Raster Algebra and Calculations to maximize signal-to-noise ratios before modeling.

# Assuming stacked is (bands, rows, cols) with NIR at index 3 and Red at index 2
nir = stacked[3].astype(np.float32)
red = stacked[2].astype(np.float32)
ndvi = (nir - red) / (nir + red + 1e-10)  # small epsilon prevents division by zero

# Combine original bands + NDVI into final feature array
features = np.vstack([stacked, ndvi[np.newaxis, ...]])

Step 3: Model Training and Prediction

With features prepared, the workflow transitions to machine learning. Python’s ecosystem allows seamless integration between geospatial arrays and statistical libraries. The core challenge is reshaping 3D raster data into a 2D (n_pixels, n_features) matrix, extracting labeled training samples, and training a classifier. Random Forests are widely adopted in geospatial workflows due to their robustness to noise, non-parametric nature, and built-in feature importance metrics. A production-ready implementation covering stratified sampling, hyperparameter tuning, and memory-efficient chunked prediction is available in Training a random forest model for land cover classification.

from sklearn.ensemble import RandomForestClassifier

# Flatten features to (n_pixels, n_features)
n_bands, h, w = features.shape
feature_matrix = features.reshape(n_bands, -1).T

# Mask out invalid pixels
valid_mask = ~np.isnan(feature_matrix).any(axis=1)
X_valid = feature_matrix[valid_mask]

# Train with placeholder labels (replace with actual ground truth coordinates)
y_train = np.random.randint(0, 4, size=(X_valid.shape[0],))
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_valid, y_train)

# Predict and reshape back to raster dimensions
predictions = np.full((h, w), -1, dtype=np.int32)
predictions[valid_mask.reshape(h, w)] = clf.predict(X_valid)

Step 4: Accuracy Assessment and Validation

Classification outputs require rigorous validation before deployment. Relying solely on training accuracy leads to overfitting; independent validation samples must be withheld during model fitting. Standard metrics include Overall Accuracy, the F1-score for imbalanced classes, and Cohen’s Kappa to account for agreement by chance. Spatial autocorrelation can artificially inflate accuracy metrics if training and validation points are clustered geographically. Implementing spatial cross-validation or block-based sampling ensures your model generalizes across diverse landscapes. Refer to the official scikit-learn Model Evaluation Guide for metric implementations and robust cross-validation strategies.

Step 5: Post-Processing and Output Generation

Raw pixel-wise classifications often contain salt-and-pepper noise due to spectral variability within homogeneous areas. Post-processing techniques like majority filtering, morphological smoothing, or conditional random fields clean boundaries and improve map coherence. Once refined, the classified raster is typically converted to vector polygons for integration with GIS software, database storage, or cartographic styling. Converting contiguous pixel blocks into topologically clean polygons reduces file size and enables attribute-driven symbology. For a step-by-step workflow on polygonizing classified outputs, see Vectorizing raster outputs for cartographic styling.

from scipy.ndimage import uniform_filter

# Simple majority filter using a 3x3 moving window
smoothed = uniform_filter(predictions.astype(np.float32), size=3)
smoothed = np.round(smoothed).astype(np.int32)

# Save result with original geospatial metadata (see rasterio docs for write modes)
with rasterio.open("classified_map.tif", "w", **profile) as dst:
    dst.write(smoothed, 1)

Conclusion

Image classification in Python GIS is a structured pipeline that balances spatial data engineering with machine learning principles. By aligning inputs, engineering meaningful features, validating rigorously, and post-processing outputs, you transform raw satellite pixels into reliable thematic layers. As cloud-native formats and scalable compute environments mature, these workflows will increasingly support near-real-time environmental monitoring and automated decision-making systems.

Image Classification Workflows in Python GIS

Step 1: Data Preparation and Alignment

Step 2: Feature Engineering

Step 3: Model Training and Prediction

Step 4: Accuracy Assessment and Validation

Step 5: Post-Processing and Output Generation

Conclusion

Guides in this topic

Training a Random Forest Model for Land Cover Classification in Python

Vectorizing Raster Outputs for Cartographic Styling

Image Classification Workflows in Python GIS

Step 1: Data Preparation and Alignment #

Step 2: Feature Engineering #

Step 3: Model Training and Prediction #

Step 4: Accuracy Assessment and Validation #

Step 5: Post-Processing and Output Generation #

Conclusion #

Guides in this topic

Training a Random Forest Model for Land Cover Classification in Python

Vectorizing Raster Outputs for Cartographic Styling

Step 1: Data Preparation and Alignment

Step 2: Feature Engineering

Step 3: Model Training and Prediction

Step 4: Accuracy Assessment and Validation

Step 5: Post-Processing and Output Generation

Conclusion