Implementing Spatial Data Versioning with Git

Traditional version control systems like Git are optimized for line-based text, but spatial datasets introduce binary structures and floating-point coordinate drift that break standard workflows. Shapefiles, GeoTIFFs, and GeoPackages generate unreadable diffs and rapidly inflate repository size. To version spatial data reliably, you must convert geometries into deterministic, text-based formats, neutralize coordinate noise, and establish a programmatic diffing pipeline.

flowchart LR
    A[Spatial dataset] --> B["Normalize: snap precision + sort + export GeoJSON"]
    B --> C[Configure Git: .gitattributes + LFS]
    C --> D[Commit to repository]
    D --> E["Programmatic spatial diff by id"]
    E --> F["Report added / removed / modified"]

Step 1: Normalize Coordinates and Export to Text

Geographic Information Systems store coordinates as 64-bit floating-point numbers. Exporting through different tools or projections introduces microscopic rounding differences (0.0000000001), which Git interprets as complete file rewrites. The fix is snapping vertices to a fixed precision grid and enforcing deterministic row/column ordering before export.

import geopandas as gpd
import shapely
from pathlib import Path

def prepare_for_git(input_path: str, output_path: str, precision: int = 6) -> None:
    """
    Normalizes spatial data for Git by rounding coordinates,
    sorting deterministically, and exporting to plain-text GeoJSON.
    """
    gdf = gpd.read_file(input_path)
    geom_col = gdf.geometry.name
    attr_cols = sorted([c for c in gdf.columns if c != geom_col])

    # Snap vertices to a fixed grid to eliminate floating-point drift
    # Requires Shapely 2.0+: https://shapely.readthedocs.io/en/stable/reference/shapely.set_precision.html
    grid_size = 10 ** -precision
    gdf.geometry = gdf.geometry.apply(lambda geom: shapely.set_precision(geom, grid_size))

    # Enforce deterministic output order
    gdf = gdf.sort_values(by=attr_cols).reset_index(drop=True)
    gdf = gdf[attr_cols + [geom_col]]

    # Export as human-readable, diff-friendly GeoJSON
    gdf.to_file(output_path, driver="GeoJSON")
    print(f"Normalized dataset exported to {output_path}")

# Usage: prepare_for_git("input/parcels.gpkg", "version/parcels_v2.geojson", precision=5)

Step 2: Configure Git for Spatial Workflows

Once exported, GeoJSON files behave like standard code files. However, spatial repositories require explicit Git configuration to prevent line-ending corruption and manage large geometry files. Add a .gitattributes file to your repository root:

*.geojson text diff=geojson eol=lf
*.geojson filter=lfs diff=lfs merge=lfs -text

The first line enforces Unix line endings and enables custom diff drivers. The second line routes large files through Git LFS, which stores pointers in Git while keeping binaries in a separate cache. This prevents repository bloat while maintaining full version history. When integrating spatial data into broader pipelines, align these practices with your organization’s Enterprise GIS Architecture standards to ensure consistent data governance across teams.

Step 3: Programmatic Spatial Diffing

Line-by-line text diffs are ineffective for spatial data. Coordinate arrays shift even when geometries remain visually identical. A reliable diff requires comparing features by a stable identifier and evaluating geometric equality directly.

import geopandas as gpd
import shapely

def diff_spatial_versions(old_path: str, new_path: str) -> list[dict]:
    """
    Compares two normalized GeoJSON files and returns a list of changes.
    Requires a consistent 'id' column in both datasets.
    """
    old_gdf = gpd.read_file(old_path)
    new_gdf = gpd.read_file(new_path)

    old_ids = set(old_gdf["id"])
    new_ids = set(new_gdf["id"])

    changes = []

    # Track added and removed features
    for fid in (new_ids - old_ids):
        changes.append({"id": fid, "status": "added"})
    for fid in (old_ids - new_ids):
        changes.append({"id": fid, "status": "removed"})

    # Check modified geometries
    common_ids = old_ids & new_ids
    for fid in common_ids:
        old_geom = old_gdf.loc[old_gdf["id"] == fid, "geometry"].iloc[0]
        new_geom = new_gdf.loc[new_gdf["id"] == fid, "geometry"].iloc[0]
        if not shapely.equals(old_geom, new_geom):
            changes.append({"id": fid, "status": "modified"})

    return changes

# Example: print(diff_spatial_versions("v1/parcels.geojson", "v2/parcels.geojson"))

This approach bypasses coordinate array noise and reports only meaningful structural changes. Understanding these versioning mechanics is a core component of mastering Fundamentals of Python GIS, particularly when automating data validation or deployment workflows.

Debugging and Fast Resolution Checklist

Symptom Root Cause Resolution
git diff shows entire file changed Floating-point drift or unsorted rows Re-run prepare_for_git() with consistent precision and verify sort_values() uses stable columns.
shapely.set_precision error Outdated Shapely version Upgrade: pip install --upgrade shapely>=2.0.0
GeoJSON fails to parse in Git Mixed line endings (CRLF/LF) Add *.geojson eol=lf to .gitattributes and run git add --renormalize .
Diff reports false modifications Missing or inconsistent id column Ensure both datasets share a primary key before running diff_spatial_versions().
Repository size spikes after commit Large geometries tracked natively Move *.geojson to Git LFS: git lfs track "*.geojson" then recommit.

For persistent coordinate validation issues, cross-reference your output with the official GeoJSON Specification to ensure RFC 7946 compliance before merging into shared branches.