Optimizing Large Spatial Queries with PyGeos
Processing millions of geographic features frequently bottlenecks traditional Python GIS workflows. When spatial operations like point-in-polygon tests, spatial joins, or proximity searches scale beyond tens of thousands of records, standard iterative approaches become computationally prohibitive. PyGeos addresses this challenge by bridging Python with the GEOS C library through fully vectorized array operations. By combining spatial indexing with batch processing, you can reduce query execution times from hours to seconds. This guide walks through the exact steps required to accelerate large-scale spatial queries while maintaining predictable memory usage. Effective Spatial Data Processing & Analysis relies heavily on understanding how underlying geometry engines handle coordinate math and topological relationships. Traditional GIS libraries often wrap C-level functions in Python loops, forcing the interpreter to repeatedly marshal and unmarshal geometry objects. PyGeos eliminates this overhead by storing geometries in contiguous NumPy arrays and applying GEOS predicates across entire batches simultaneously.
The Architecture of Vectorized Spatial Indexing
Before implementing code, it is important to understand why raw iteration fails at scale. Every time Python calls a geometry method inside a for loop, it triggers a context switch between the Python runtime and the compiled GEOS backend. This overhead compounds exponentially as dataset size grows. PyGeos solves this problem through two complementary mechanisms: array-based vectorization and bounding-box filtering. The library first evaluates coarse spatial relationships using minimum bounding rectangles, which are computationally cheap. Only candidate pairs that pass this initial filter undergo precise topological evaluation. This two-phase architecture is the foundation of modern Spatial Indexing for Performance.
flowchart LR
A["Geometry arrays<br/>(NumPy)"] --> B["Build STRtree"]
B --> C["query_bulk<br/>bounding-box candidates"]
C --> D["Vectorized predicate<br/>pygeos.contains()"]
D --> E["Exact matches"]
The primary indexing structure used is the STRtree (Sort-Tile-Recursive tree). Unlike dynamic R-trees that rebalance on every insert, the STRtree is optimized for static datasets. It packs geometries into a balanced hierarchy during construction, enabling rapid bulk queries with minimal memory fragmentation. By leveraging NumPy’s contiguous memory model, PyGeos pushes the heavy computational lifting directly to the GEOS Geometry Engine, bypassing Python’s interpreter overhead entirely.
Step 1: Environment Setup and Data Preparation
Install the required packages. PyGeos depends on NumPy for array handling and GEOS for geometric computations.
pip install pygeos numpy
Load your spatial data into NumPy-compatible geometry arrays. PyGeos does not operate directly on GeoDataFrame rows. Instead, extract coordinate arrays and convert them to PyGeos geometry objects. The following example generates synthetic data to demonstrate the workflow, but the same pattern applies to real-world GeoJSON or Shapefile imports.
import numpy as np
import pygeos
# Generate 1,000,000 query points
x = np.random.uniform(0, 10000, 1_000_000)
y = np.random.uniform(0, 10000, 1_000_000)
points = pygeos.points(x, y)
# Generate 50,000 rectangular polygons (bounding boxes)
xmin = np.random.uniform(0, 9000, 50_000)
ymin = np.random.uniform(0, 9000, 50_000)
xmax = xmin + 1000
ymax = ymin + 1000
polygons = pygeos.box(xmin, ymin, xmax, ymax)
Step 2: Constructing the STRtree Index
The STRtree is built once and queried repeatedly. It requires an array of geometries and returns an index object optimized for bulk operations. Construction time is linear relative to dataset size, but query time drops to logarithmic complexity.
# Build the spatial index on the polygon array
tree = pygeos.STRtree(polygons)
# Query the index using the points array
# Returns a 2xN array: [point_indices, polygon_indices]
candidates = tree.query_bulk(points)
The query_bulk method performs the initial bounding-box intersection. It returns a two-dimensional array where the first row contains indices pointing to the input points array, and the second row contains indices pointing to the indexed polygons array. This candidate list is intentionally oversized to guarantee zero false negatives, but it will contain false positives that require precise filtering.
Step 3: Executing Batched Spatial Predicates
Index queries return candidates, not exact matches. You must apply precise geometric predicates (e.g., contains, intersects) to filter false positives. PyGeos applies these across the entire candidate array without Python loops.
# Extract the corresponding geometries for vectorized evaluation
# Row 0 indexes the query points; row 1 indexes the tree polygons
candidate_points = points[candidates[0]]
candidate_polys = polygons[candidates[1]]
# Apply the precise 'contains' predicate in one vectorized call
exact_matches = pygeos.contains(candidate_polys, candidate_points)
# Filter indices to keep only true spatial relationships
valid_point_idx = candidates[0][exact_matches]
valid_poly_idx = candidates[1][exact_matches]
print(f"Found {len(valid_point_idx)} exact spatial matches.")
This step demonstrates the core performance advantage. Instead of calling .contains() 50 million times in a loop, PyGeos evaluates the entire candidate matrix in a single C-level operation. The boolean mask exact_matches is applied directly to the index arrays, preserving the original positional mapping for downstream joins or aggregations.
Step 4: Memory Management and Production Scaling
Vectorization trades CPU cycles for RAM. Loading 10 million geometries into memory simultaneously can exhaust system resources. Implement chunking to process data in predictable blocks.
CHUNK_SIZE = 500_000
valid_matches = []
for i in range(0, len(points), CHUNK_SIZE):
chunk = points[i:i + CHUNK_SIZE]
chunk_candidates = tree.query_bulk(chunk)
# Row 0 indexes the chunk's points; shift it back to the global scope
chunk_candidates[0] += i
chunk_pts = points[chunk_candidates[0]]
chunk_polys = polygons[chunk_candidates[1]]
mask = pygeos.contains(chunk_polys, chunk_pts)
valid_matches.append(chunk_candidates[:, mask])
# Concatenate results across all chunks
final_results = np.hstack(valid_matches)
Chunking prevents MemoryError exceptions by keeping active arrays within CPU cache boundaries. After processing each block, Python’s garbage collector automatically reclaims the temporary arrays. For production pipelines, consider using mmap for read-only geometry arrays or migrating to Shapely 2.0+, which natively integrates PyGeos’ vectorized backend while maintaining broader ecosystem compatibility.
Conclusion
Optimizing large spatial queries requires shifting from iterative Python logic to compiled, batch-oriented workflows. PyGeos achieves this by storing geometries in contiguous arrays, leveraging the STRtree for rapid candidate filtering, and applying topological predicates in bulk. By combining index construction, vectorized filtering, and strategic chunking, you can transform multi-hour spatial operations into sub-second tasks. As datasets continue to grow, adopting array-native geometry processing remains the most reliable path to scalable, production-ready GIS applications.