Working with Zarr and Cloud Optimized GeoTIFFs in Python
Modern geospatial workflows have largely abandoned the practice of downloading multi-gigabyte raster files to local disks. Instead, analysts now stream data directly from cloud object storage, dramatically reducing latency, storage costs, and infrastructure overhead. This paradigm shift relies heavily on two complementary standards: Cloud Optimized GeoTIFFs (COGs) and Zarr. While both formats are engineered to minimize network traffic, they optimize for fundamentally different computational patterns. Learning how to read, convert, and compute across these formats is a core competency for anyone practicing scalable Remote Sensing & Raster Analysis.
Core Concepts: COG vs. Zarr
A Cloud Optimized GeoTIFF is a standard TIFF file that has been internally reorganized. It uses spatial tiling and a pyramid of lower-resolution overviews to structure pixel data. This layout enables clients to issue HTTP range requests, fetching only the exact byte ranges required for a specific bounding box or zoom level. You can render or subset a massive satellite image without ever transferring the full file to your machine. For a deeper technical breakdown of the specification, consult the official Cloud Optimized GeoTIFF Specification.
Zarr takes a fundamentally different approach. Instead of a single binary container, it stores N-dimensional arrays as a collection of compressed, independent chunk files organized in a directory hierarchy. Because each chunk is a discrete object, cloud storage systems can serve them in parallel, making Zarr exceptionally well-suited for distributed array mathematics and large-scale data lakes. Choosing between tile-based streaming (COG) and chunk-parallel computation (Zarr) is a foundational architectural decision covered in depth under Cloud-Native Geospatial Formats, but implementing either requires a solid Python foundation.
flowchart LR
A["Read COG<br/>(lazy, rioxarray)"] --> B["Rechunk &<br/>write to Zarr"]
B --> C["Parallel raster algebra<br/>(Dask, lazy)"]
C --> D["Query subset<br/>(.sel) & compute"]
Environment Setup
To bridge raster I/O, chunked arrays, and cloud storage protocols, you will need a specific stack of open-source libraries. Install them in an isolated virtual environment:
pip install xarray rioxarray zarr rasterio fsspec dask
rioxarray extends the popular xarray library with geospatial coordinate handling and raster drivers. zarr provides the chunked storage backend, while fsspec abstracts away the differences between local filesystems and cloud providers like AWS S3 or Google Cloud Storage. dask handles lazy evaluation and parallel task scheduling behind the scenes, allowing computations to scale across multiple CPU cores or distributed clusters.
Step 1: Reading a COG with Lazy Loading
The first step in any cloud-native workflow is streaming remote data without triggering a full download. The rioxarray.open_rasterio() function automatically translates spatial requests into HTTP range queries. By specifying chunks="auto", you instruct the library to wrap the data in Dask arrays, enabling lazy evaluation. This means Python reads only the metadata and tile structure initially; actual pixel values are fetched from the network only when you explicitly request them or trigger a computation.
import rioxarray
import xarray as xr
# Public Landsat 8 Band 4 COG (Red band)
cog_url = (
"https://storage.googleapis.com/gcp-public-data-landsat/"
"LC08/01/015/032/LC08_L1TP_015032_20230515_20230525_02_T1/"
"LC08_L1TP_015032_20230515_20230525_02_T1_B4.TIF"
)
# Open with automatic chunking for lazy evaluation.
# Name the DataArray so it can later be written to a Zarr store
# (an unnamed DataArray cannot be serialized with to_zarr()).
ds = rioxarray.open_rasterio(cog_url, chunks="auto").rename("red")
print(ds)
The output displays an xarray.DataArray backed by Dask. Dimensions correspond to spatial coordinates (x, y), and the data remains virtual until you call .compute() or .values. At this stage, Python has consumed negligible memory and bandwidth.
Step 2: Converting a COG to Zarr
While COGs are ideal for visualization and quick subsetting, converting them to Zarr unlocks high-performance parallel analytics. The conversion process involves reading the COG chunks and writing them to a Zarr store, typically on cloud storage or a high-throughput local drive. You must explicitly define a chunk size that aligns with your analytical workload to avoid inefficient partial reads.
import zarr
# Rechunk to optimize for array operations (e.g., 512x512 tiles)
ds_chunked = ds.chunk({"band": 1, "y": 512, "x": 512})
# Write to a local Zarr directory
zarr_path = "./output_landsat_red.zarr"
ds_chunked.to_zarr(zarr_path, mode="w")
print(f"Successfully converted COG to Zarr at {zarr_path}")
During this step, xarray serializes the array into compressed chunks. The mode="w" flag ensures a fresh store is created. For production pipelines, you would typically point zarr_path to a cloud URI like s3://my-bucket/data.zarr, leveraging fsspec to handle authentication transparently.
Step 3: Parallel Processing & Raster Algebra
Once your data resides in Zarr, you can leverage Dask’s distributed computing engine to perform heavy raster algebra without loading the entire dataset into RAM. Operations like normalization, index calculation, or masking are applied lazily across chunks. The official xarray Documentation provides extensive guidance on integrating Dask with geospatial workflows.
# Load the Zarr store
zarr_ds = xr.open_zarr(zarr_path)
# Example: Normalize reflectance values (Landsat 8 L1TP Digital Numbers)
# Scale factor is typically 0.0001 for surface reflectance, but L1TP uses raw DN
# We'll apply a simple scaling for demonstration
normalized = (zarr_ds - 10000) / 10000.0
# Trigger computation and save result
result_path = "./normalized_landsat.zarr"
normalized.to_zarr(result_path, mode="w")
Because the computation is lazy, Dask builds a task graph first. When .to_zarr() executes, it schedules chunk-level operations in parallel, maximizing CPU and I/O throughput while keeping memory footprint predictable.
Step 4: Querying & Subsetting Zarr Data
Zarr stores preserve geospatial metadata, allowing you to query them using standard coordinate indexing. This is particularly useful for region-of-interest analysis or extracting time-series points without scanning irrelevant pixels.
# Re-open the normalized dataset
processed_ds = xr.open_zarr(result_path)
# Spatial subset using bounding box coordinates
subset = processed_ds.sel(
x=slice(400000, 450000),
y=slice(4800000, 4750000)
)
# Compute only the requested region
local_data = subset.compute()
print(local_data)
The .sel() method translates coordinate bounds into chunk indices, ensuring only the necessary files are fetched from storage. This selective I/O is a primary advantage of chunked array formats over traditional monolithic rasters.
Production Best Practices
Deploying these formats at scale requires attention to storage configuration and chunk alignment. Always match your Zarr chunk size to your most common analytical window size to avoid partial chunk reads, which degrade performance. Use compression codecs like zstd or blosc to reduce storage costs without sacrificing read speed. When working with cloud providers, configure fsspec with appropriate caching layers to prevent redundant API calls. Finally, validate coordinate reference systems (CRS) early in your pipeline to prevent misalignment during spatial joins or raster stacking.
Conclusion
Migrating from local raster files to cloud-native formats fundamentally changes how geospatial data is stored, accessed, and processed. COGs provide seamless, tile-based streaming for visualization and quick subsetting, while Zarr delivers chunk-level parallelism for heavy analytical workloads. By combining rioxarray, xarray, and dask, Python developers can build efficient, scalable pipelines that handle petabytes of earth observation data. Mastering these tools positions you at the forefront of modern spatial engineering and cloud-native data architecture.