Reading and Writing GeoJSON Files Efficiently in Python

GeoJSON has become a cornerstone of modern geospatial workflows due to its human-readable structure, native compatibility with web mapping libraries, and straightforward JSON serialization. For practitioners navigating the Fundamentals of Python GIS, mastering input/output operations for this format is essential. Unlike legacy binary formats, GeoJSON stores geometry and attributes as nested dictionaries. This design offers remarkable flexibility but introduces significant memory overhead when handling datasets exceeding a few hundred megabytes. Understanding how Vector Data Formats manage serialization and memory allocation enables developers to select the appropriate parsing strategy for their pipeline.

Environment Preparation

Before implementing I/O routines, ensure your Python environment is isolated and properly configured. Virtual environments prevent dependency conflicts and guarantee reproducible execution across development and production servers. Install the required packages using pip or conda:

pip install geopandas pandas ijson shapely

When configuring geospatial environments, prioritize geopandas>=0.13.0 and shapely>=2.0.0. These versions leverage modern GEOS bindings, improved coordinate handling, and vectorized operations that dramatically accelerate spatial computations.

Coordinate Reference System Constraints

The GeoJSON specification (RFC 7946) strictly mandates WGS84 (EPSG:4326) longitude-latitude coordinates. Any coordinate transformation must occur before writing or immediately after reading. While GeoPandas handles projection metadata automatically, manual JSON parsing requires explicit validation to prevent spatial misalignment or silent coordinate inversion. Always verify the crs attribute early in your workflow to avoid downstream topology errors and ensure interoperability with web mapping frameworks.

Efficient Reading Strategies

Standard geopandas.read_file() loads the entire dataset into memory. This approach works seamlessly for files under 50 MB but quickly triggers MemoryError exceptions when processing enterprise-scale exports.

flowchart TD
    A[GeoJSON input] --> B{File size?}
    B -->|small to medium| C["read_file() -> GeoDataFrame"]
    B -->|large / exceeds RAM| D["Stream with ijson"]
    D --> E["Batch into chunks via from_features()"]
    C --> F[Validate CRS = EPSG:4326]
    E --> F

Standard Approach (Small to Medium Files)

For typical project files, a direct read with automatic CRS validation is both readable and performant:

import geopandas as gpd

def read_geojson_standard(filepath: str) -> gpd.GeoDataFrame:
    """Reads a GeoJSON file into a GeoDataFrame with CRS validation."""
    gdf = gpd.read_file(filepath)
    if gdf.crs is None:
        gdf.set_crs(epsg=4326, inplace=True)
    return gdf

Streaming Approach (Large Files)

For datasets that exceed available RAM, use ijson to parse the file incrementally. This approach extracts features one by one, constructs geometries on-the-fly, and batches them into manageable chunks. The generator pattern keeps peak memory usage constant regardless of file size.

import ijson
from shapely.geometry import shape
import geopandas as gpd

def stream_geojson(filepath: str, chunk_size: int = 10000):
    """Yields GeoDataFrames in chunks from a large GeoJSON file."""
    chunk = []
    with open(filepath, "r", encoding="utf-8") as f:
        for feature in ijson.items(f, "features.item"):
            chunk.append(feature)
            if len(chunk) >= chunk_size:
                yield gpd.GeoDataFrame.from_features(chunk, crs="EPSG:4326")
                chunk.clear()
    if chunk:
        yield gpd.GeoDataFrame.from_features(chunk, crs="EPSG:4326")

Memory-Aware Writing Techniques

Writing GeoJSON efficiently follows the same memory-management principles as reading. Exporting a 500 MB GeoDataFrame in a single operation can temporarily double RAM consumption during serialization.

Direct Export

For standard workflows, GeoPandas provides a reliable one-line export that handles geometry serialization and attribute formatting automatically:

def write_geojson_standard(gdf: gpd.GeoDataFrame, filepath: str) -> None:
    """Exports a GeoDataFrame to a GeoJSON file."""
    if gdf.crs is not None:
        gdf = gdf.to_crs(epsg=4326)
    gdf.to_file(filepath, driver="GeoJSON")

Chunked Output for Enterprise Workflows

When generating GeoJSON from databases or streaming APIs, write features incrementally to avoid holding the entire dataset in memory. The following pattern opens a file handle once, writes the JSON structure manually, and flushes data in controlled batches:

import json
from shapely.geometry import mapping

def write_geojson_chunked(feature_generator, filepath: str) -> None:
    """Streams features to a GeoJSON file without loading all into memory."""
    with open(filepath, "w", encoding="utf-8") as f:
        f.write('{"type": "FeatureCollection", "features": [')
        first = True
        for feature in feature_generator:
            if not first:
                f.write(",")
            f.write(json.dumps({"type": "Feature", "geometry": mapping(feature), "properties": {}}))
            first = False
        f.write("]}")

Performance Optimization Checklist

Reading and writing GeoJSON files efficiently requires more than just swapping libraries. Apply these production-tested practices to stabilize your geospatial pipelines:

  • Drop Unnecessary Columns: Filter attributes before serialization. GeoJSON embeds every column in each feature, so redundant text fields bloat file size and parsing time.
  • Simplify Geometries: Use gdf.simplify(tolerance=0.001) before export to reduce coordinate density without sacrificing visual accuracy at web zoom levels.
  • Validate JSON Structure Early: Run a quick schema check with jsonschema or pydantic to catch malformed coordinates before they trigger silent failures in downstream consumers.
  • Prefer Binary Formats for Internal Storage: If your workflow involves repeated transformations, convert GeoJSON to Parquet or GeoParquet after ingestion. Reserve GeoJSON strictly for web delivery and human-readable exchange.

By aligning your I/O routines with dataset scale and memory constraints, you can maintain responsive Python applications while preserving the interoperability that makes GeoJSON indispensable in modern spatial architecture.