Preparing Training Data for Semantic Segmentation: A Python GIS Workflow
Preparing training data for semantic segmentation is the most critical bottleneck in modern geospatial machine learning. Unlike traditional image classification, which assigns a single label to an entire scene, semantic segmentation requires pixel-level ground truth. Every input tile must have a corresponding label mask that matches its exact spatial extent, resolution, and coordinate reference system (CRS). When executed correctly, this workflow transforms raw satellite imagery and GIS vector annotations into a clean, model-ready dataset that directly dictates the accuracy, convergence speed, and real-world reliability of your architecture.
This guide walks through a production-grade Python GIS pipeline for preparing segmentation datasets, emphasizing spatial integrity, memory efficiency, and statistical validation.
The data preparation pipeline proceeds through the stages shown below.
flowchart LR
A["Imagery + vector<br/>annotations"] --> B["Align CRS,<br/>resolution, extent"]
B --> C["Rasterize polygons<br/>to label masks"]
C --> D["Tile image + mask<br/>(overlap, drop empty)"]
D --> E["Spatial blocking<br/>train / val / test"]
E --> F["Quality control<br/>(imbalance, mask integrity)"]
Step 1: Spatial Alignment and Coordinate System Harmonization
Geospatial datasets rarely arrive pre-aligned. Before any machine learning pipeline can begin, you must verify that your imagery and annotation layers share the same CRS, pixel size, and bounding box. Misaligned rasters introduce geometric distortions that corrupt label masks and degrade model convergence.
Always work in a projected coordinate system (e.g., EPSG:32633 for UTM zones) rather than geographic coordinates (EPSG:4326). Projected systems preserve metric distances, which is essential for calculating accurate patch sizes, buffer zones, and spatial statistics. Geographic coordinates distort area and distance as you move away from the equator, introducing silent errors into your training loop.
Use rasterio to inspect metadata and align layers programmatically. The official rasterio documentation provides comprehensive guidance on CRS transformations and affine matrix handling.
import rasterio
from rasterio.warp import calculate_default_transform, reproject, Resampling
def align_raster_to_reference(input_path, reference_path, output_path):
with rasterio.open(reference_path) as ref:
dst_crs = ref.crs
dst_transform = ref.transform
dst_shape = (ref.count, ref.height, ref.width)
dst_res = ref.res
with rasterio.open(input_path) as src:
# Calculate transformation to match reference
transform, width, height = calculate_default_transform(
src.crs, dst_crs, src.width, src.height, *src.bounds, resolution=dst_res
)
kwargs = src.meta.copy()
kwargs.update({
'crs': dst_crs,
'transform': transform,
'width': width,
'height': height,
'dtype': src.dtypes[0]
})
with rasterio.open(output_path, 'w', **kwargs) as dst:
for i in range(1, src.count + 1):
reproject(
source=rasterio.band(src, i),
destination=rasterio.band(dst, i),
src_transform=src.transform,
src_crs=src.crs,
dst_transform=transform,
dst_crs=dst_crs,
resampling=Resampling.bilinear
)
Step 2: Converting Vector Annotations to Dense Raster Masks
Most GIS practitioners collect training labels as polygons or points in shapefiles or GeoJSON. Deep learning frameworks, however, expect dense raster masks where each pixel value corresponds to a specific class ID. The rasterio.features.rasterize function efficiently burns vector geometries into a NumPy array aligned with your imagery.
This step bridges traditional GIS workflows with Feature Engineering for Spatial Models, ensuring that categorical boundaries translate cleanly into tensor-compatible formats.
import geopandas as gpd
import numpy as np
from rasterio.features import rasterize
def create_label_mask(shapefile_path: str, reference_raster_path: str, output_mask_path: str) -> None:
with rasterio.open(reference_raster_path) as src:
transform = src.transform
out_shape = (src.height, src.width)
target_crs = src.crs
gdf = gpd.read_file(shapefile_path)
# Harmonize CRS if mismatched
if gdf.crs != target_crs:
gdf = gdf.to_crs(target_crs)
# Prepare (geometry, class_id) tuples
# Ensure class_id is integer and non-negative
shapes = ((geom, int(class_id)) for geom, class_id in zip(gdf.geometry, gdf['class_id']))
# Rasterize with all_touched=True to capture thin features
mask = rasterize(
shapes=shapes,
out_shape=out_shape,
transform=transform,
fill=0, # Background class
dtype=np.uint8,
all_touched=True # Ensures pixels intersecting polygon edges are included
)
with rasterio.open(
output_mask_path, 'w',
driver='GTiff',
height=out_shape[0], width=out_shape[1],
count=1, dtype=np.uint8,
transform=transform,
crs=target_crs
) as dst:
dst.write(mask, 1)
Key parameters explained:
fill=0: Assigns a default background class to pixels outside any polygon.all_touched=True: Forces rasterization to mark any pixel that intersects a polygon boundary, preventing thin linear features (roads, rivers) from disappearing during downsampling.dtype=np.uint8: Limits memory overhead while supporting up to 255 distinct classes.
Step 3: Memory-Efficient Tiling and Patch Extraction
Satellite imagery and aerial orthomosaics routinely exceed GPU memory limits. Tiling splits large rasters into manageable patches (e.g., 256×256 or 512×512 pixels) while preserving spatial context. Unlike bounding-box extraction used in Deep Learning for Object Detection, segmentation tiling requires strict pixel-to-pixel alignment between image and mask arrays.
A sliding-window approach with configurable overlap prevents edge artifacts during inference. Overlapping tiles ensure that features cut by tile boundaries are fully visible in at least one patch, improving boundary prediction accuracy.
import os
from pathlib import Path
import rasterio
import numpy as np
def generate_tiles(image_path: str, mask_path: str, output_dir: str,
tile_size: int = 256, overlap: int = 32) -> None:
Path(output_dir).mkdir(parents=True, exist_ok=True)
img_dir = os.path.join(output_dir, "images")
msk_dir = os.path.join(output_dir, "masks")
Path(img_dir).mkdir(exist_ok=True)
Path(msk_dir).mkdir(exist_ok=True)
with rasterio.open(image_path) as img_src, rasterio.open(mask_path) as msk_src:
height, width = img_src.height, img_src.width
step = tile_size - overlap
tile_idx = 0
for y in range(0, height, step):
for x in range(0, width, step):
# Handle edge cases
y_end = min(y + tile_size, height)
x_end = min(x + tile_size, width)
# Read windows
img_window = img_src.read(window=((y, y_end), (x, x_end)))
msk_window = msk_src.read(window=((y, y_end), (x, x_end)))
# Skip tiles that are entirely background (class 0)
if np.all(msk_window == 0):
continue
img_tile_path = os.path.join(img_dir, f"tile_{tile_idx}.tif")
msk_tile_path = os.path.join(msk_dir, f"tile_{tile_idx}.tif")
# Compute the per-tile transform for the clipped window
window = rasterio.windows.Window(x, y, x_end - x, y_end - y)
tile_transform = img_src.window_transform(window)
# Override size and transform so metadata matches the tile arrays
img_meta = img_src.meta.copy()
img_meta.update(height=y_end - y, width=x_end - x, transform=tile_transform)
msk_meta = msk_src.meta.copy()
msk_meta.update(height=y_end - y, width=x_end - x, transform=tile_transform)
with rasterio.open(img_tile_path, 'w', **img_meta) as dst:
dst.write(img_window)
with rasterio.open(msk_tile_path, 'w', **msk_meta) as dst:
dst.write(msk_window)
tile_idx += 1
Step 4: Spatial Validation and Statistical Quality Control
Random train/validation/test splits fail in geospatial contexts due to Spatial Autocorrelation and Statistics. Nearby pixels share similar spectral and contextual properties, meaning a random split leaks information from training into validation sets, inflating performance metrics artificially.
Implement spatial blocking (e.g., grid-based or watershed-based partitioning) to ensure geographic separation between splits. Additionally, validate your dataset for:
- Class imbalance: Calculate pixel-level frequency distributions. Apply class weighting or focal loss during training if background dominates.
- Mask integrity: Verify that masks contain no
NaNvalues, unexpected class IDs, or misaligned dimensions. - Boundary consistency: Ensure polygon edges align with image features at the target resolution.
Tools like scikit-learn’s GroupKFold or specialized spatial cross-validation libraries help enforce geographic independence during evaluation.
Step 5: Production Pipeline Integration
A well-prepared dataset is only as valuable as the pipeline that consumes it. Integrate your tiling and validation steps into a reproducible workflow using configuration files (YAML/JSON) and logging. Cache preprocessed tiles to avoid redundant I/O during training epochs.
As you scale toward Model Deployment for GIS Applications, consider lazy loading strategies and memory-mapped arrays to handle datasets exceeding RAM capacity. Monitor training curves closely when Evaluating Geospatial AI Performance, paying special attention to boundary IoU (Intersection over Union) and per-class recall. Finally, apply Advanced Geospatial AI Optimization techniques such as mixed-precision training, gradient accumulation, and data augmentation (e.g., random rotations, elastic deformations) to maximize model robustness without inflating dataset size.
Conclusion
Preparing training data for semantic segmentation demands rigorous spatial alignment, precise vector-to-raster conversion, and statistically sound partitioning. By treating data preparation as a deterministic engineering process rather than an ad-hoc preprocessing step, you establish a reliable foundation for high-performance geospatial models. The resulting pipeline scales cleanly from prototype to production, ensuring your architecture learns meaningful spatial patterns rather than artifact-driven noise.