Resampling Hourly Solar Data to Monthly Averages Without Drift or OOM Kills

Scenario / symptom: an hourly Global Horizontal Irradiance (GHI) stack resampled with ds["ghi"].resample(time="ME").mean() produces monthly averages that are 5–8% off the values an independent engineer computes from the same source — or the call never returns and the kernel is killed with Killed (signal 9) after RAM exhaustion. Both symptoms land in the temporal reduction stage of Temporal Data Aggregation, where a high-frequency time-series is collapsed to the monthly granularity that yield models and interconnection studies consume. Neither failure raises a clean exception: the numbers are simply wrong, or the process dies, and the cost surfaces at regulatory review or financial close.

High-frequency irradiance stacks are the computational backbone of long-term yield forecasting, but the move from hourly to monthly granularity routinely fractures downstream GIS pipelines. The fix is to normalise the time index to a confirmed timezone, force lazy out-of-core execution, and aggregate in energy units rather than instantaneous power — then gate the result before it can feed a capacity-factor model.

Root-cause analysis

The pipeline fractures at three independent intersection points. Each one passes silently in isolation, and they compound when a multi-year stack is reduced in a single call.

Timezone & calendar boundary drift. pandas and xarray resample to calendar-month boundaries ('MS'/'ME') against whatever time index they are handed. NSRDB, ERA5, and Solcast exports store hourly timestamps in UTC. Resampling UTC timestamps to local calendar months without an explicit conversion shifts every aggregation window by the local offset (5–8 hours for North American zones), leaking partial-day irradiance across month boundaries and biasing peak values up or down.
Unbounded memory allocation. An unchunked multi-year raster stack forces full in-memory materialization. A single 10-year hourly GHI stack at 1 km resolution easily exceeds 50 GB. Calling .resample() without explicit dask chunking triggers scheduler deadlocks or OOM kills, especially when combined with spatial operations such as terrain masking or shadow casting.
Physics violation in aggregation. An arithmetic mean over a 24-hour window dilutes daylight hours with night-time zeros and misrepresents the capacity factor. Irradiance reduction must conserve energy: sum hourly power to monthly energy, then divide by the count of valid hours. For an average that preserves the energy balance over a month of $N$ valid hourly samples $G_{i}$ (in W·m⁻²):

\overset{ˉ}{G}_{month} = \frac{\sum _{i = 1}^{N} G _{i} Δ t}{N Δ t} = \frac{1}{N} i = 1 \sum N G_{i} with N gated by a coverage threshold

The subtlety is not the formula — it is that $N$ must count valid hours only, and months below a coverage threshold must be masked rather than reported.

Pre-flight validation

Before resampling, surface all three root causes in one cheap pass. This validator reads only metadata and the time coordinate — it never materializes the data array — so it is safe to run as the first step of a CI/CD job. It confirms the time index is timezone-explicit, estimates the in-memory footprint, and checks that the stack is chunked.

python

import xarray as xr
import numpy as np
import pandas as pd


def preflight_resample_check(path: str, var: str = "ghi", ram_budget_gb: float = 16.0) -> dict:
    """Surface tz-drift, OOM, and coverage risks before resampling an hourly GHI stack."""
    ds = xr.open_dataset(path, chunks={})  # open lazily, read metadata only
    da = ds[var]
    report = {"path": path, "ok": True, "warnings": []}

    # 1. Timezone explicitness — naive UTC stacks must be declared before any local resample
    t = pd.DatetimeIndex(da["time"].values)
    if t.tz is None:
        report["warnings"].append(
            "time index is timezone-naive; declare source tz (assume UTC for NSRDB/ERA5) "
            "before resampling to a local calendar month")
    report["inferred_freq"] = pd.infer_freq(t)
    if report["inferred_freq"] not in ("h", "H"):
        report["warnings"].append(f"non-hourly cadence {report['inferred_freq']!r}; coverage math assumes 1 h steps")

    # 2. Memory footprint — float32 element count vs RAM budget
    est_gb = da.size * np.dtype("float32").itemsize / 1024**3
    report["estimated_gb"] = round(est_gb, 2)
    if est_gb > ram_budget_gb and not da.chunks:
        report["warnings"].append(
            f"unchunked {est_gb:.1f} GB array exceeds {ram_budget_gb} GB budget; "
            "open with chunks={'time': 720, 'y': 256, 'x': 256}")

    # 3. Coverage — months that cannot reach the threshold should be flagged early
    months = t.to_period("M").nunique()
    report["n_months"] = int(months)

    report["ok"] = not report["warnings"]
    ds.close()
    return report


if __name__ == "__main__":
    print(preflight_resample_check("solar_irradiance_hourly.nc"))

A clean pre-flight pass guarantees the fix below runs deterministically. Treat any warning as a hard stop in automated pipelines — the same discipline applied in spatial data quality validation gates, where a bad layer is halted before it reaches a downstream model.

Fix implementation

The corrected pipeline uses xarray and dask with explicit timezone normalization, lazy evaluation, and energy-conserving aggregation. Every parameter is chosen for energy-GIS use: float32 keeps the stack within a workstation RAM budget, time=720 aligns chunk boundaries with ~30-day months to minimise cross-chunk reduction, and a 90% coverage mask suppresses months that would otherwise report a biased mean.

python

import xarray as xr
import pandas as pd

# Configuration
INPUT_NC = "solar_irradiance_hourly.nc"
OUTPUT_NC = "solar_irradiance_monthly.nc"
TARGET_TZ = "America/New_York"
CHUNKS = {"time": 720, "y": 256, "x": 256}
COVERAGE_FRACTION = 0.90

# 1. Lazy load with explicit chunking to prevent OOM
ds = xr.open_dataset(INPUT_NC, chunks=CHUNKS)

# 2. Timezone normalization (UTC -> project local).
# xarray requires timezone-naive datetimes; convert via pandas, then strip tz info
# so the resample window lands on local calendar months without drift.
local_times = (
    pd.DatetimeIndex(ds["time"].values)
    .tz_localize("UTC")
    .tz_convert(TARGET_TZ)
    .tz_localize(None)  # local time is now encoded in the values
)
ds = ds.assign_coords(time=local_times)

# 3. Physics-compliant aggregation (W/m^2 -> Wh/m^2 -> monthly mean W/m^2).
# Sum hourly power to monthly energy, then divide by the count of valid hours
# to recover average irradiance while preserving the energy balance.
monthly_energy = ds["ghi"].resample(time="ME").sum(skipna=True)   # Wh/m^2 per month
monthly_hours = ds["ghi"].resample(time="ME").count()             # valid hours per month
ghi_monthly_avg = (
    monthly_energy / monthly_hours.where(monthly_hours > 0)
).rename("ghi_monthly_avg").astype("float32")

# 4. Coverage mask: suppress months with < 90% temporal coverage
coverage_threshold = 720 * COVERAGE_FRACTION  # ~30 days x 24 h
ghi_monthly_avg = ghi_monthly_avg.where(monthly_hours >= coverage_threshold)

# 5. CF-compliant metadata + deterministic compressed write
ghi_monthly_avg.attrs.update({
    "units": "W m-2",
    "standard_name": "surface_downwelling_shortwave_flux_in_air",
    "cell_methods": "time: mean",
    "processing_tz": TARGET_TZ,
    "aggregation_method": "energy_sum_then_divide",
    "coverage_threshold": COVERAGE_FRACTION,
})
ghi_monthly_avg.to_netcdf(
    OUTPUT_NC,
    encoding={"ghi_monthly_avg": {"zlib": True, "complevel": 4, "_FillValue": -9999.0}},
    engine="netcdf4",
)

The energy-summation approach also absorbs daylight-saving and leap-year irregularities: a 23-hour or 25-hour DST day changes the valid-hour count, not the per-hour energy, so the divisor self-corrects. For strict audit runs, set skipna=False so any gap propagates to NaN and is caught by the mask rather than silently imputed. Confirm the source crs_wkt/spatial_ref survives the round-trip; if it is dropped, reassign explicitly with ds.rio.write_crs("EPSG:4326", inplace=True) before the write, following the same coordinate reference system discipline the upstream stages enforce.

Fallback routing & performance tuning

For continental-scale stacks or CI/CD runners with constrained RAM, layer in these controls before reaching for a bigger machine:

Chunk to the calendar, not the disk. Keep time=720 so each ~30-day window resolves inside one chunk and the reducer avoids cross-chunk shuffles; keep spatial chunks square (256×256) for cache locality during raster reads.
Enforce lazy evaluation. Never call .compute() or .load() before .resample(). Chain operations lazily and trigger execution only at .to_netcdf() or .to_zarr() so dask streams the stack instead of materializing it.
Tune the scheduler to the host. On a workstation use dask.config.set(scheduler="threads") to avoid GIL contention on array math; on a multi-node Dask cluster route through dask.distributed with explicit --memory-limit 4GB workers so a runaway chunk is spilled, not OOM-killed.
Spill to Zarr under pressure. If memory still spikes, write monthly intermediates with ghi_monthly_avg.to_zarr("monthly_intermediate.zarr", mode="w") — Zarr’s chunked store sidesteps NetCDF locking and enables parallel downstream reads.
Fall back, never extrapolate. If a target month lacks hourly data, drop to daily aggregates with a logged warning. Never synthesise a monthly mean from fewer than 10 days; mask and document it instead so a reviewer sees a gap rather than a fabricated value.

Downstream validation

Gate the output before it reaches a yield model. This audit asserts temporal monotonicity, CRS persistence, dtype, and physical bounds (0 ≤ GHI ≤ 1400 W·m⁻²), and is cheap enough to run as a CI/CD step on every produced artifact.

python

import xarray as xr
import numpy as np


def audit_monthly_ghi(path: str, var: str = "ghi_monthly_avg") -> None:
    """Fail fast if a resampled monthly GHI artifact is not pipeline-ready."""
    ds = xr.open_dataset(path)
    da = ds[var]

    t = ds["time"].to_index()
    assert t.is_monotonic_increasing, "time axis is not monotonic increasing"
    assert da.dtype == np.float32, f"expected float32, got {da.dtype}"

    crs = ds.attrs.get("crs_wkt") or ds.rio.crs if hasattr(ds, "rio") else None
    assert crs is not None, "CRS metadata missing; downstream merges will misalign"

    valid = da.where(np.isfinite(da))
    vmax = float(valid.max()) if valid.count() else 0.0
    vmin = float(valid.min()) if valid.count() else 0.0
    assert 0.0 <= vmin and vmax <= 1400.0, f"GHI out of physical bounds: [{vmin}, {vmax}]"
    assert da.attrs.get("cell_methods") == "time: mean", "missing CF cell_methods tag"

    ds.close()
    print(f"PASS audit: {path} ({da.sizes.get('time', 0)} months, max {vmax:.0f} W/m^2)")


if __name__ == "__main__":
    audit_monthly_ghi("solar_irradiance_monthly.nc")

Pairing this audit with a SHA-256 checksum of the NetCDF file in your asset registry gives you the deterministic, reproducible provenance that regulatory and project-finance due diligence require — the resampled monthly averages then integrate cleanly into capacity-factor estimation and grid-planning pipelines.

Temporal Data Aggregation — the parent workflow defining the full reduction contract for solar and wind time-series.
Stacking NASA POWER and PVGIS rasters in rasterio — resolve the spatial-alignment errors that precede temporal reduction.
Calculating wind shear coefficients with Python — the wind-side counterpart to physics-correct solar aggregation.
Coordinate Reference Systems for Energy Projects — the CRS foundation every resampled artifact must carry forward.

Resampling Hourly Solar Data to Monthly Averages Without Drift or OOM Kills #

Root-cause analysis #

Pre-flight validation #

Fix implementation #

Fallback routing & performance tuning #

Downstream validation #

Related #

Related articles

Resampling Hourly Solar Data to Monthly Averages Without Drift or OOM Kills

Root-cause analysis

Pre-flight validation

Fix implementation

Fallback routing & performance tuning

Downstream validation

Related