Batch Geocoding Addresses with Geopy and OpenStreetMap

Converting human-readable street addresses into machine-readable geographic coordinates is a foundational operation in spatial computing. When working with dozens of locations, manual lookup is manageable. When processing hundreds or thousands of records, automation becomes mandatory. Batch geocoding solves this by programmatically querying a geospatial service, parsing the response, and appending latitude and longitude values to your dataset. This guide demonstrates how to implement a reliable batch geocoding workflow in Python using geopy and OpenStreetMap’s Nominatim API.

As part of broader Spatial Data Processing & Analysis pipelines, automated geocoding bridges the gap between tabular business records and geographic visualization. The process relies on matching address components against a spatial reference database, returning coordinates in a standard decimal degree format suitable for mapping and spatial joins.

Understanding the Geocoding Workflow

Geocoding translates textual addresses into spatial coordinates. OpenStreetMap provides a free, community-maintained search engine called Nominatim. Because the service is publicly funded and shared across millions of users, it enforces strict usage policies. You must supply a custom user_agent identifier, limit requests to one per second, and avoid high-frequency commercial scraping. Respecting these constraints ensures your script runs without triggering IP blocks or receiving degraded results. For a comprehensive breakdown of coordinate transformation techniques and API selection criteria, refer to our guide on Geocoding and Reverse Geocoding.

Before executing any code, prepare your environment. Install the required Python packages:

pip install geopy pandas

Step-by-Step Implementation

The batch process follows a predictable sequence: load structured data, initialize the geocoder with proper headers, iterate through each address while enforcing rate limits, handle missing or malformed responses, and export the enriched dataset.

flowchart TD
    A["Load CSV<br/>into DataFrame"] --> B["Init Nominatim<br/>(user_agent)"]
    B --> C{"More addresses?"}
    C -->|no| H["Export enriched CSV"]
    C -->|yes| D["geocode(address)"]
    D --> E{"Match found?"}
    E -->|yes| F["Append lat/lon + Success"]
    E -->|no / error| G["Append None + status"]
    F --> I["sleep(1) rate limit"]
    G --> I
    I --> C

1. Prepare Your Input Data

Store your addresses in a CSV file with a dedicated column for location strings. Clean formatting significantly improves match accuracy. Remove unnecessary punctuation, standardize abbreviations (e.g., St. to Street), and ensure each row contains a complete address. Pandas handles CSV ingestion efficiently, but data hygiene upfront prevents silent failures during the API query phase.

2. Configure the Nominatim Geocoder

Initialize geopy with a descriptive application name and contact email. This satisfies OpenStreetMap’s transparency requirements and helps maintainers identify legitimate usage patterns. The Nominatim class accepts these parameters directly, and geopy automatically handles HTTP request formatting, URL encoding, and JSON parsing behind the scenes. Detailed configuration options are documented in the official Geopy documentation.

3. Build the Batch Function

Create a loop that queries each address sequentially. Apply a one-second delay between requests to comply with API rate limiting guidelines. Extract the latitude and longitude safely, accounting for cases where the service returns no match. Using try...except blocks prevents the entire script from crashing on a single malformed address, while time.sleep() enforces the required request interval. Always review the Nominatim Usage Policy before deploying scripts at scale.

Complete Runnable Script

The following code demonstrates a production-ready batch geocoding routine. It reads a CSV, processes each address, handles errors gracefully, and saves the results with new coordinate columns.

import pandas as pd
import time
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut, GeocoderServiceError

def batch_geocode(input_csv: str, output_csv: str, address_column: str = "address"):
    """
    Reads a CSV, geocodes each address using Nominatim, and exports the enriched dataset.
    """
    # Load data into a pandas DataFrame (a two-dimensional tabular data structure)
    df = pd.read_csv(input_csv)

    # Initialize the Nominatim geocoder with a custom user_agent string
    # Replace with your actual application name and contact email
    geolocator = Nominatim(user_agent="my_gis_app_v1")

    latitudes = []
    longitudes = []
    statuses = []

    # Iterate through each row in the DataFrame
    for idx, row in df.iterrows():
        address = str(row[address_column]).strip()

        # Skip empty or null values gracefully
        if not address or pd.isna(row[address_column]):
            latitudes.append(None)
            longitudes.append(None)
            statuses.append("Empty Address")
            continue

        try:
            # Query the geocoding service
            location = geolocator.geocode(address)

            if location:
                latitudes.append(location.latitude)
                longitudes.append(location.longitude)
                statuses.append("Success")
            else:
                latitudes.append(None)
                longitudes.append(None)
                statuses.append("No Match")

        except (GeocoderTimedOut, GeocoderServiceError) as e:
            # Catch network or API errors without halting execution
            latitudes.append(None)
            longitudes.append(None)
            statuses.append(f"Error: {e}")

        # Enforce rate limit: 1 request per second
        time.sleep(1)

    # Append results as new columns
    df["latitude"] = latitudes
    df["longitude"] = longitudes
    df["geocode_status"] = statuses

    # Export the enriched dataset
    df.to_csv(output_csv, index=False)
    print(f"Processed {len(df)} records. Results saved to {output_csv}")

# Example usage:
# batch_geocode("input_addresses.csv", "output_geocoded.csv", address_column="full_address")

Key Implementation Notes

  • Rate Limiting: The time.sleep(1) call is non-negotiable for free-tier Nominatim access. Removing it will result in HTTP 429 (Too Many Requests) errors.
  • Error Handling: The script catches GeocoderTimedOut and GeocoderServiceError, ensuring transient network issues don’t terminate the batch job.
  • Output Validation: The geocode_status column allows you to filter successful matches from failures before loading coordinates into GIS software like QGIS or ArcGIS.

For larger datasets exceeding 10,000 records, consider migrating to a paid geocoding provider or hosting a self-managed Nominatim instance to bypass public API restrictions.