Training a Random Forest Model for Land Cover Classification in Python
Land cover classification converts raw satellite imagery into actionable thematic maps by assigning every pixel to a discrete surface category, such as forest, cropland, open water, or impervious urban surfaces. Among supervised machine learning algorithms, the Random Forest classifier has emerged as the industry standard for geospatial analysis. As an ensemble method, it aggregates predictions from hundreds of individual decision trees to reduce variance, resist overfitting, and naturally handle the high-dimensional, non-linear relationships inherent in multispectral reflectance data. For practitioners navigating the broader landscape of Remote Sensing & Raster Analysis, translating spectral signatures into reliable spatial categories is a foundational challenge that requires careful data alignment and reproducible code.
This guide provides a complete, production-ready workflow for training and applying a Random Forest model to multispectral imagery. The pipeline aligns with standardized Image Classification Workflows, moving systematically from raster ingestion and sample extraction to model training, spatial prediction, and validated export.
flowchart LR
A["Load & reshape<br/>to (pixels, bands)"] --> B["Extract samples<br/>from polygons"]
B --> C["Train Random Forest<br/>& validate"]
C --> D["Predict &<br/>rebuild grid"]
D --> E["Export GeoTIFF<br/>& assess accuracy"]
Environment Setup and Data Requirements
Before executing the pipeline, ensure your Python environment contains the core geospatial and machine learning dependencies. Because geospatial libraries rely on compiled C libraries like GDAL, using a package manager such as conda or mamba is strongly recommended to avoid dependency conflicts.
conda create -n gis-rf python=3.11
conda activate gis-rf
conda install -c conda-forge rasterio geopandas numpy pandas scikit-learn
Required packages:
rasterio: Reads and writes georeferenced raster formats (GeoTIFF) while preserving coordinate reference systems (CRS) and affine transformations.geopandas: Handles vector training data (Shapefiles/GeoJSON) and manages spatial joins and projections.numpy&pandas: Perform high-performance array manipulation and tabular data structuring.scikit-learn: Provides theRandomForestClassifierimplementation and evaluation metrics.
You will also need two input files: a multi-band satellite image (e.g., Sentinel-2 or Landsat 8/9) and a vector file containing labeled polygons or points representing known land cover classes.
Step 1: Load and Reshape the Satellite Imagery
Machine learning algorithms expect tabular input: rows representing individual observations and columns representing features. Satellite imagery is stored as a three-dimensional array (bands, rows, columns). We must flatten this spatial grid into a two-dimensional (pixels, bands) matrix while tracking which pixels contain valid data.
import rasterio
import numpy as np
# Open the multi-band raster
with rasterio.open('satellite_image.tif') as src:
# Read all bands into memory
image_data = src.read()
# Store metadata for later export
meta = src.meta.copy()
nodata_val = src.nodata if src.nodata is not None else 0
# Reshape from (bands, height, width) to (pixels, bands)
n_bands, height, width = image_data.shape
pixel_matrix = image_data.reshape(n_bands, -1).T
# Create a boolean mask for valid pixels (exclude nodata across all bands)
valid_mask = np.all(pixel_matrix != nodata_val, axis=1)
valid_pixels = pixel_matrix[valid_mask]
Key concept: The .T (transpose) operation converts the band-major array into a pixel-major array. The valid_mask ensures that background or cloud-masked pixels are excluded from both training and prediction, preventing the model from learning artifacts.
Step 2: Extract Training Samples from Vector Polygons
Supervised classification requires ground truth. Training polygons typically contain a class_id or label attribute. We will extract the spectral values of every pixel that intersects each polygon, pairing those values with the polygon’s class label.
import geopandas as gpd
from rasterio.mask import mask
# Load training polygons
training_gdf = gpd.read_file('training_polygons.shp')
# Align coordinate reference systems
if training_gdf.crs != src.crs:
training_gdf = training_gdf.to_crs(src.crs)
train_features = []
train_labels = []
for _, row in training_gdf.iterrows():
# Extract raster values within the polygon boundary
out_image, _ = mask(src, [row.geometry], crop=False, nodata=np.nan)
# Reshape extracted patch to (pixels, bands)
patch_pixels = out_image.reshape(n_bands, -1).T
# Filter out NaN values introduced outside the polygon geometry
valid_patch = patch_pixels[~np.isnan(patch_pixels).any(axis=1)]
# Append to training lists
train_features.append(valid_patch)
train_labels.extend([row['class_id']] * len(valid_patch))
# Stack into final training arrays
X_train = np.vstack(train_features)
y_train = np.array(train_labels)
Production note: If your training dataset contains millions of pixels, consider downsampling each polygon to a fixed count (e.g., np.random.choice(len(valid_patch), size=500, replace=False)) to balance class representation and reduce memory overhead during model fitting.
Step 3: Train the Random Forest Classifier
With a clean tabular dataset, we can instantiate and train the classifier. Random Forests do not require feature scaling, which simplifies preprocessing. We will split the data into training and validation subsets to evaluate generalization performance.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Split data: 80% training, 20% validation
X_tr, X_val, y_tr, y_val = train_test_split(
X_train, y_train, test_size=0.2, random_state=42, stratify=y_train
)
# Initialize and train the model
# n_jobs=-1 utilizes all available CPU cores for parallel tree construction
rf_model = RandomForestClassifier(
n_estimators=150,
max_depth=None,
min_samples_split=5,
random_state=42,
n_jobs=-1
)
rf_model.fit(X_tr, y_tr)
# Evaluate on validation set
y_pred_val = rf_model.predict(X_val)
print(classification_report(y_val, y_pred_val))
The classification_report outputs precision, recall, and F1-score per class, highlighting spectral confusion (e.g., distinguishing bare soil from dry grassland). For deeper algorithmic tuning, consult the official scikit-learn Random Forest documentation.
Step 4: Predict and Reconstruct the Spatial Grid
Once validated, the model predicts land cover for the entire valid pixel matrix. We then map these predictions back onto the original spatial dimensions, preserving the nodata mask.
# Predict on all valid pixels
full_predictions = rf_model.predict(valid_pixels)
# Initialize output array with nodata values
classified_grid = np.full((height, width), nodata_val, dtype=np.uint8)
# Inject predictions back into their original spatial positions
classified_grid.reshape(-1)[valid_mask] = full_predictions
Memory consideration: For rasters exceeding 10,000 × 10,000 pixels, loading the entire array into RAM may cause MemoryError. In production environments, implement block-wise processing using src.block_shapes or leverage cloud-native formats like Zarr/Cloud-Optimized GeoTIFF for out-of-core computation.
Step 5: Export and Validate the Thematic Map
The final step writes the classified array to a new GeoTIFF, ensuring geospatial metadata matches the input imagery.
# Update metadata for single-band categorical output
meta.update({
'count': 1,
'dtype': 'uint8',
'nodata': nodata_val,
'compress': 'lzw' # Lossless compression reduces file size
})
with rasterio.open('land_cover_rf.tif', 'w', **meta) as dst:
dst.write(classified_grid, 1)
Validation best practices:
- Confusion Matrix: Compare a holdout set of independent ground truth points against the exported raster using
sklearn.metrics.confusion_matrix. - Spatial Autocorrelation: Verify that classification errors are not clustered, which may indicate systematic sensor artifacts or inadequate training sample distribution.
- Visual Inspection: Overlay the output with high-resolution basemaps or RGB composites to catch edge-case misclassifications.
Conclusion
Training a Random Forest model for land cover classification bridges statistical learning and spatial analysis. By rigorously handling coordinate alignment, nodata masking, and array reshaping, you create a reproducible pipeline that scales from local study areas to regional monitoring programs. As you advance, consider integrating temporal feature stacks, leveraging cloud-native spatial data lakes for distributed training, or transitioning to deep learning architectures for complex urban texture analysis. The foundation established here remains directly applicable across the broader geospatial machine learning ecosystem.