Measuring IoU and F1 Scores for Map Predictions

When training machine learning models to generate land cover classifications, building footprints, or agricultural field boundaries, raw pixel accuracy frequently masks critical spatial errors. A naive model that predicts the dominant background class for 95% of a scene can achieve deceptively high accuracy while completely failing to map the target features. In geospatial workflows, boundary misalignment, fragmented geometries, and severe class imbalance demand evaluation metrics that explicitly measure spatial overlap rather than simple pixel counts.

Intersection over Union (IoU) and the F1 score provide a rigorous, spatially aware framework for quantifying how well a model captures both the geographic location and the exact extent of target features. This guide explains the mathematical foundations, delivers a production-ready Python implementation, and outlines best practices for integrating these metrics into modern Evaluating Geospatial AI Performance pipelines.

Why Pixel Accuracy Fails in Spatial Contexts

Traditional classification metrics assume independent and identically distributed (i.i.d.) samples. Raster data violates this assumption due to spatial autocorrelation: neighboring pixels share spectral, topographic, and contextual similarities. When models are evaluated on highly correlated pixels, accuracy metrics become inflated and fail to penalize boundary shifts or small geometric errors.

Furthermore, geospatial datasets are inherently imbalanced. Roads, wetlands, or infrastructure corridors often occupy less than 2% of a scene. In such cases, precision and recall become far more informative than overall accuracy, and combining them into a single spatially grounded metric prevents misleading performance claims.

The Mathematical Foundations: IoU and F1

Intersection over Union (Jaccard Index)

IoU measures the ratio of overlapping area to the combined area of the predicted and ground truth regions for a specific class:

IoU=Area(Intersection)Area(Union)IoU = \frac{\text{Area}(\text{Intersection})}{\text{Area}(\text{Union})}

Values range from 0 (zero spatial overlap) to 1 (perfect geometric match). IoU is highly sensitive to boundary misalignment, making it the industry standard for evaluating segmentation outputs from convolutional neural networks, random forest classifiers, and object-based image analysis pipelines.

F1 Score (Harmonic Mean of Precision and Recall)

The F1 score balances two complementary metrics into a single value:

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

F1=2PrecisionRecallPrecision+RecallF_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

Precision answers: How many predicted pixels actually belong to the target class? Recall answers: How many true target pixels were successfully identified?

The harmonic mean heavily penalizes extreme imbalances between precision and recall. F1 becomes indispensable when mapping rare or fragmented features, ensuring that a model cannot achieve high scores by simply over-predicting or under-predicting.

Prerequisites for Spatial Metric Computation

Before computing metrics, your ground truth and prediction rasters must satisfy three strict alignment conditions:

  1. Identical Spatial Resolution: Pixel dimensions must match exactly (e.g., both 0.5m/pixel).
  2. Matching Coordinate Reference Systems (CRS): Both rasters must use the same projection or geographic coordinate system.
  3. Consistent Extent and Nodata Handling: Rasters must cover the exact same geographic bounds, and invalid pixels (clouds, sensor gaps, or padding) must be masked identically.

Failure to align rasters will produce artificially low IoU/F1 scores due to spatial offsets, not model deficiencies. Tools like rasterio.warp.reproject or gdal.Warp should be used for preprocessing.

Production-Ready Python Implementation

The following script uses rasterio for spatial I/O, numpy for array manipulation, and scikit-learn for statistical scoring. It is designed to run out-of-the-box for binary or multi-class raster predictions while safely handling nodata values.

The metric computation pipeline is outlined below.

flowchart LR
    A["Ground truth raster"] --> C["Validate identical<br/>shape & CRS"]
    B["Prediction raster"] --> C
    C --> D["Combine valid masks<br/>(exclude nodata)"]
    D --> E["Flatten to<br/>1D arrays"]
    E --> F["Compute IoU, F1,<br/>Precision, Recall"]
import numpy as np
import rasterio
from sklearn.metrics import jaccard_score, f1_score, precision_score, recall_score

def load_raster_and_mask(path):
    """
    Load a single-band raster and return the data array with a valid pixel mask.
    """
    with rasterio.open(path) as src:
        data = src.read(1).astype(np.float32)
        nodata = src.nodata

        # Create a boolean mask for valid pixels
        if nodata is not None:
            valid_mask = data != nodata
        else:
            valid_mask = np.ones_like(data, dtype=bool)

        return data, valid_mask

def compute_map_metrics(gt_path, pred_path, average='binary', zero_division=0.0):
    """
    Calculate IoU and F1 scores for aligned geospatial rasters.

    Parameters:
    -----------
    gt_path : str
        Path to the ground truth raster.
    pred_path : str
        Path to the predicted raster.
    average : str, default='binary'
        Averaging method for multi-class rasters ('macro', 'weighted', 'micro').
    zero_division : float, default=0.0
        Value to return when precision/recall is undefined.

    Returns:
    --------
    dict : Dictionary containing IoU, F1, Precision, and Recall.
    """
    gt_data, gt_mask = load_raster_and_mask(gt_path)
    pred_data, pred_mask = load_raster_and_mask(pred_path)

    if gt_data.shape != pred_data.shape:
        raise ValueError("Ground truth and prediction rasters must have identical dimensions.")

    # Combine masks: only evaluate pixels that are valid in BOTH rasters
    combined_mask = gt_mask & pred_mask
    gt_flat = gt_data[combined_mask]
    pred_flat = pred_data[combined_mask]

    # Compute spatial metrics
    metrics = {
        "IoU": jaccard_score(gt_flat, pred_flat, average=average, zero_division=zero_division),
        "F1": f1_score(gt_flat, pred_flat, average=average, zero_division=zero_division),
        "Precision": precision_score(gt_flat, pred_flat, average=average, zero_division=zero_division),
        "Recall": recall_score(gt_flat, pred_flat, average=average, zero_division=zero_division)
    }

    return metrics

# Example usage:
# results = compute_map_metrics("ground_truth.tif", "model_prediction.tif", average='binary')
# print(f"IoU: {results['IoU']:.4f} | F1: {results['F1']:.4f}")

Step-by-Step Code Breakdown

  1. Safe Raster Loading: The load_raster_and_mask function uses rasterio’s context manager to ensure files are properly closed. It extracts the nodata value from the raster metadata and creates a boolean mask to exclude invalid pixels from evaluation.
  2. Dimension Validation: The script explicitly checks that both arrays share the same shape. This prevents silent failures where misaligned extents produce misleadingly low scores.
  3. Combined Masking: combined_mask = gt_mask & pred_mask ensures that only pixels valid in both rasters are evaluated. This is critical when ground truth and predictions have different cloud cover or sensor gaps.
  4. Array Flattening: sklearn metrics expect 1D arrays. Indexing with the boolean mask flattens the spatial grid into a vector of class labels, preserving the spatial correspondence between ground truth and predictions.
  5. Metric Computation: The jaccard_score function directly computes IoU. Setting zero_division=0.0 prevents NaN outputs when a class is entirely absent in both arrays, a common scenario in highly imbalanced geospatial datasets.

For detailed configuration options, refer to the official scikit-learn model evaluation documentation and the rasterio API reference.

Interpreting Results and GIS Best Practices

Thresholds and Real-World Expectations

  • IoU > 0.70: Excellent spatial agreement. Suitable for high-precision applications like cadastral mapping or infrastructure planning.
  • IoU 0.50–0.70: Acceptable for regional land cover mapping or preliminary environmental assessments.
  • IoU < 0.50: Indicates significant boundary drift or class confusion. Requires model retraining, improved training data, or post-processing.

Handling Fragmented Features

When mapping discontinuous features like wetlands or urban green spaces, IoU naturally penalizes small disconnected predictions. To improve scores without compromising spatial accuracy, integrate morphological operations (e.g., opening/closing) during Feature Engineering for Spatial Models to remove salt-and-pepper noise before evaluation.

From Evaluation to Deployment

In production environments, IoU and F1 should be monitored continuously. Drift in these metrics often signals changes in sensor characteristics, seasonal vegetation shifts, or urban expansion. Integrating automated scoring into Model Deployment for GIS Applications ensures that retraining triggers are tied to spatial performance degradation rather than arbitrary time intervals.

Optimizing Model Architectures

During training, standard cross-entropy loss often fails to optimize spatial overlap. Switching to Dice Loss or Tversky Loss directly aligns the optimization objective with IoU maximization. This is a cornerstone of Advanced Geospatial AI Optimization, particularly for Deep Learning for Object Detection and semantic segmentation tasks where boundary precision dictates downstream utility.

Conclusion

Measuring IoU and F1 scores for map predictions transforms subjective visual inspection into quantifiable, reproducible spatial evaluation. By properly aligning rasters, masking invalid pixels, and leveraging robust statistical libraries, GIS practitioners can accurately diagnose model behavior, optimize training pipelines, and confidently deploy geospatial AI into operational workflows. As spatial datasets grow in complexity and resolution, these metrics will remain the foundational standard for validating that machine learning outputs truly reflect the physical world.