Mapping high-voltage transmission lines from OpenStreetMap

ValueError: cannot convert float NaN to integer is the error most analysts hit when extracting high-voltage (HV) corridors from OpenStreetMap, and it breaks the attribute-filtering stage — the step that should isolate power=line features tagged at 110 kV or above before anything is projected or buffered. The same extract often fails two more ways downstream: a MemoryError when a regional .osm.pbf is read whole, and a CRSError (or silent metric distortion) when geometries are buffered in EPSG:4326 or naively pushed to EPSG:3857. All three are ingestion defects, and all three poison the asset layer that the transmission line and substation mapping workflow hands to every downstream calculation. This page resolves each failure with a parser, a projection guard, and a memory-bounded loader, then gates the result with an audit function suitable for a CI/CD pipeline.

Root-cause analysis

The breakdown is not a single bug — it is three compounding causes that each surface at a different stage of the extract:

Tag fragmentation. OSM contributors encode voltage as voltage, voltage:primary, or voltage:secondary, and the value itself is unstable: a bare 110000, a unit-suffixed 110 kV, or a semicolon-delimited multi-circuit string such as 110000;380000. A naive .astype(int) throws on the suffix, the delimiter, and the missing values — the ValueError above.
CRS drift. OSM data arrives in geographic EPSG:4326 (WGS84), where the unit is the degree. Measuring a setback or buffering a right-of-way in that frame is wrong by a latitude-dependent factor, and forcing the data into Web Mercator (EPSG:3857) trades one distortion for another because its area scale diverges sharply at mid-to-high latitudes.
Memory overhead. geopandas.read_file() loads every geometry into RAM at once. Dense, high-vertex transmission corridors exhaust the heap during GEOS topology validation, triggering the MemoryError long before any analysis runs.

The failing pipeline below reproduces all three causes from a small simulated extract — the .astype(int) cast aborts on the None and the 110;380000 string, and even if it survived, the direct EPSG:3857 buffer would be metrically wrong.

python

import geopandas as gpd
from shapely.geometry import LineString

# Simulated raw OSM extract — bare value, multi-circuit string, and a missing tag
transmission_gdf = gpd.GeoDataFrame({
    "power": ["line", "line", "line"],
    "voltage": ["110000", "110;380000", None],
    "geometry": [LineString([(0, 0), (1, 1)]),
                 LineString([(1, 1), (2, 2)]),
                 LineString([(2, 2), (3, 3)])],
}, crs="EPSG:4326")

# Fails: mixed types, NaNs, and unit suffixes all reach .astype(int)
transmission_gdf["voltage_int"] = transmission_gdf["voltage"].str.replace("kV", "").astype(int)
hv_lines = transmission_gdf[transmission_gdf["voltage_int"] >= 110000]

Pre-flight validation

Before touching the main extract, profile the voltage column so the root cause surfaces as a report rather than a traceback. This compact check counts how many records carry suffixes, multi-circuit delimiters, or missing values, and confirms the source CRS is the expected geographic frame — letting a CI job fail fast with a readable reason.

python

import pandas as pd

def preflight_osm_voltage(transmission_gdf: gpd.GeoDataFrame) -> dict:
    """Surface tag fragmentation and CRS drift before the main extract runs."""
    raw = transmission_gdf["voltage"].astype("string")
    report = {
        "total_features": len(transmission_gdf),
        "missing_voltage": int(raw.isna().sum()),
        "multi_circuit": int(raw.str.contains(";", na=False).sum()),
        "unit_suffixed": int(raw.str.contains(r"[a-zA-Z]", na=False).sum()),
        "source_crs": str(transmission_gdf.crs),
        "is_geographic": bool(transmission_gdf.crs and transmission_gdf.crs.is_geographic),
    }
    # A non-zero count in any of the first three guarantees .astype(int) will throw
    if report["missing_voltage"] or report["multi_circuit"] or report["unit_suffixed"]:
        report["recommended_action"] = "Route through parse_voltage_max before filtering."
    return report

Fix implementation

The corrected extract replaces the brittle cast with a deterministic parser, then enforces a projected metric CRS before any buffer is computed. Each parameter choice is justified for grid GIS work rather than left to a library default.

Voltage normalization with fallback routing

The parser strips everything but digits and semicolons, splits multi-circuit strings, and keeps the maximum voltage per feature — the value that determines whether a corridor is bulk transmission. Missing tags are not dropped (that loses live assets); they are filled with a conservative 110 kV fallback and flagged for manual review, the same quarantine-not-discard discipline used in network attribute validation.

python

import re
import numpy as np

def parse_voltage_max(series: pd.Series) -> pd.Series:
    """Extract maximum voltage (volts) from OSM strings, tolerating kV/V suffixes and ';'."""
    cleaned = series.astype(str).str.replace(r"[^\d;]", "", regex=True)

    def _resolve_max(val: str) -> float:
        if not val or val == "nan":
            return np.nan
        parts = [float(x) for x in val.split(";") if x.strip()]
        return max(parts) if parts else np.nan

    return cleaned.apply(_resolve_max)

# Normalize, then route missing values to a flagged 110 kV fallback
transmission_gdf["voltage_v"] = parse_voltage_max(transmission_gdf["voltage"])
transmission_gdf["voltage_final"] = transmission_gdf["voltage_v"].fillna(110_000)
transmission_gdf["audit_flag"] = transmission_gdf["voltage_v"].isna()

For the authoritative tag semantics behind this parser, align the logic with the OpenStreetMap Key:voltage reference; the open-data sourcing context lives in open energy data portals.

CRS enforcement and metric buffer validation

A buffer is only meaningful in a projected frame whose unit is the meter. Rather than hard-code a zone, derive the locally correct UTM CRS from the data extent with estimate_utm_crs, repair any self-intersections the projection exposes, and assert the result is projected before measuring. This produces the metric geometry that downstream proximity distance calculations and grid capacity buffer analysis depend on.

python

def enforce_metric_crs(transmission_gdf: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    """Project a geographic extract to its local UTM zone and repair invalid topology."""
    if transmission_gdf.crs is None:
        raise ValueError("Input has no defined CRS; assign EPSG:4326 before projecting.")

    if transmission_gdf.crs.is_geographic:
        target_crs = transmission_gdf.estimate_utm_crs(datum_name="WGS 84")
    else:
        target_crs = transmission_gdf.crs

    projected = transmission_gdf.to_crs(target_crs)

    # buffer(0) resolves self-intersections the reprojection can expose
    invalid = ~projected.geometry.is_valid
    projected.loc[invalid, "geometry"] = projected.loc[invalid, "geometry"].buffer(0)
    return projected

hv_lines = transmission_gdf[transmission_gdf["voltage_final"] >= 110_000].copy()
hv_proj = enforce_metric_crs(hv_lines)
hv_proj["buffer_500m"] = hv_proj.geometry.buffer(500)  # 500 m right-of-way screen, in meters

assert hv_proj.crs.is_projected, "CRS must be projected for metric buffers"

Fallback routing and performance tuning

Loading a multi-hundred-megabyte .osm.pbf whole is what triggers the MemoryError; stream it instead and reclaim memory between slices. The strategies below keep a regional extract inside a bounded RAM envelope and scale it for CI/CD or out-of-core runs:

Bounding-box streaming. Use fiona’s filter(bbox=...) to pull only features inside the study extent, never the whole continent-sized file.
Explicit garbage collection. Call gc.collect() after each chunk so GEOS-validated geometries are released before the next slice loads.
Spatial index before joins. Build sindex once so corridor-to-substation joins prune toward O(n log n) instead of a pairwise scan.
Out-of-core escalation. When a single host still can’t hold the extract, swap fiona for pyrosm or dask-geopandas to parallelize tile ingestion across workers.
Columnar persistence. Write each validated chunk to GeoParquet — it round-trips the projected CRS losslessly and skips unused attribute columns on re-read.

python

import fiona
import gc
from shapely.geometry import box

def load_osm_chunked(filepath: str, bbox: tuple, chunk_size: int = 50_000) -> gpd.GeoDataFrame:
    """Stream an OSM extract via bounding box to prevent RAM saturation."""
    xmin, ymin, xmax, ymax = bbox
    filter_geom = box(xmin, ymin, xmax, ymax)

    with fiona.open(filepath, "r") as src:
        crs = src.crs
        # filter(bbox=...) streams only features intersecting the study extent
        features = list(src.filter(bbox=(xmin, ymin, xmax, ymax)))

    chunks = []
    for i in range(0, len(features), chunk_size):
        gdf_chunk = gpd.GeoDataFrame.from_features(features[i:i + chunk_size], crs=crs)
        gdf_chunk = gdf_chunk[gdf_chunk.geometry.intersects(filter_geom)].copy()
        chunks.append(gdf_chunk)
        del gdf_chunk
        gc.collect()  # release each slice before the next loads

    return gpd.GeoDataFrame(pd.concat(chunks, ignore_index=True), crs=crs)

Downstream validation

Before the extracted layer reaches an interconnection study or environmental screen, gate it with an audit that asserts geometry integrity, records how many voltage fallbacks were applied, and captures the CRS authority and extent. The 98% valid-geometry floor is a hard CI threshold — below it the build fails rather than shipping a corrupt asset layer, the same standard applied across spatial data quality validation.

python

def generate_audit_report(transmission_gdf: gpd.GeoDataFrame, output_path: str) -> dict:
    """Assert spatial integrity and export compliance metadata for a CI/CD gate."""
    valid = int(transmission_gdf.geometry.is_valid.sum())
    report = {
        "total_features": len(transmission_gdf),
        "valid_geometries": valid,
        "invalid_corrected": int((~transmission_gdf.geometry.is_valid).sum()),
        "voltage_fallbacks": int(transmission_gdf["audit_flag"].sum()),
        "crs_authority": transmission_gdf.crs.to_authority(),
        "spatial_extent": transmission_gdf.total_bounds.tolist(),
    }

    if report["valid_geometries"] < len(transmission_gdf) * 0.98:
        raise RuntimeError("Topology validation failed: >2% invalid geometries detected.")

    pd.DataFrame([report]).to_csv(output_path, index=False)
    return report

audit = generate_audit_report(hv_proj, "hv_line_audit_trail.csv")

This audit trail is the lineage a FERC or NERC reviewer needs to re-run the extract and arrive at the same network backbone. By replacing the brittle integer cast with deterministic voltage parsing, projecting to a data-derived UTM zone before any measurement, and streaming the source in memory-bounded chunks, the three failures that break the attribute-filtering stage are eliminated — turning a fragile OpenStreetMap extract into a topologically sound, audit-ready foundation for regional grid proximity studies.

Transmission Line & Substation Mapping — the parent workflow this OpenStreetMap extract feeds.
Network Attribute Validation — schema enforcement for the voltage, operator, and status fields parsed here.
Proximity Distance Calculations — the metric distance queries that consume the projected HV layer.
Coordinate Reference Systems for Energy Projects — UTM-zone selection behind the projection guard.

Mapping high-voltage transmission lines from OpenStreetMap #

Root-cause analysis #

Pre-flight validation #

Fix implementation #

Voltage normalization with fallback routing #

CRS enforcement and metric buffer validation #

Fallback routing and performance tuning #

Downstream validation #

Related #

Related articles

Mapping high-voltage transmission lines from OpenStreetMap

Root-cause analysis

Pre-flight validation

Fix implementation

Voltage normalization with fallback routing

CRS enforcement and metric buffer validation

Fallback routing and performance tuning

Downstream validation

Related