Skip to content

Extend xarray method coverage on Variable / LinearExpression #703

@FBumann

Description

@FBumann

Summary

Variable, LinearExpression/QuadraticExpression and Constraint each expose a hand-picked subset of xarray.Dataset methods (via varwrap / exprwrap / conwrap). The three lists have drifted apart, several common arithmetic methods are missing entirely, and a few existing ones diverge from xarray — especially around datetime indexes.

This issue collects the analysis and a design sketch for closing the gaps. All of the proposed methods preserve linearity (none multiplies two variables), so they are mostly bookkeeping.


A. Cross-class inconsistencies (existing methods)

The same wrapper is exposed on some classes but not others, with no clear reason:

method Variable Expression Constraint
compute yes no no
chunk (dask) no yes yes
reindex / reindex_like no yes yes
astype no yes no
drop / drop_vars no yes no
reset_index no yes no
rename_dims no yes yes
shift fill_value default yes (variables.py:1268) no (expressions.py:1474) no

The two that actually bite users:

  • Variable.reindex / reindex_like missing — you can reindex an expression or constraint onto a master index but not a variable. Natural operation for datetime work (aligning a variable to a full snapshot index).
  • Expression.shift / Constraint.shift have no fill_value while Variable.shift defaults to the linopy fill value. Shifted slots get coeffs=NaN, vars=-1; it mostly survives downstream fillna(0), but it is the root of the diff issue below.

B. Existing methods lacking features vs. xarray (datetime focus)

  1. diff (expressions.py:1238) — implemented as self - self.shift({dim: n}). Two divergences from xarray.diff:
    • Length: xarray trims the dimension to N-n; linopy keeps N, so the first n rows are spurious (self[0] - empty, a leading garbage term still carrying the original coordinate label).
    • No label argument — xarray has label="upper"|"lower"; linopy is hard-wired to "upper".
  2. groupby (expressions.py:1258) — typed only for DataArray/Series/DataFrame. The string-accessor form (groupby("time.month"), "time.season") only reaches the slow fallback; the fast reindex summation (expressions.py:229) requires Series/DataFrame/DataArray. groupby_bins not exposed. LinearExpressionGroupby only implements .sum() / .map() / .roll() — no .mean() / .first() / .last() / .count().
  3. rolling (expressions.py:1289)LinearExpressionRolling only implements .sum() (expressions.py:307). No .mean(), no .construct(). Integer-window only.
  4. No frequency-aware shift — neither linopy nor xarray support shift(time="1D"); pandas does. See the datetime-shift section below.

sel/isel delegate cleanly, so datetime partial-string indexing (sel(time="2030"), method="nearest") already works.

C. Missing methods, ranked by usefulness

Only linearity-preserving methods are candidates. Excluded as non-linear/meaningless: max, min, median, std, var, prod, cumprod, quantile, rank, idxmax/idxmin, argmax/argmin, clip.

Tier 1 — high value

  • mean — linear (sum/n), ubiquitous, missing on both Variable and LinearExpression.
  • resample — datetime aggregation (hourly to daily). Core energy-modeling need. Absent.
  • reindex / reindex_like on Variable — fixes the asymmetry in A.
  • weightedexpr.weighted(w).sum(); maps onto PyPSA snapshot_weightings.

Tier 2 — useful

  • coarsen — positional block aggregation (time=24).
  • rolling().mean() and groupby().mean() — extend the existing reducers.
  • transpose — missing on both; needs care to keep _term/_factor last.
  • compute on Expression, chunk on Variable — close the dask asymmetry from A.
  • dropna — drop missing coordinate slices.

Tier 3 — nice-to-have / niche

  • astype / reset_index / rename_dims on Variable (consistency).
  • sortby, squeeze, head/tail/thin, pad.
  • interp / interp_like (genuinely linear but niche).
  • groupby_bins.

Implementation design notes (Tier 1 + Tier 2)

Why none of this is hard

A linopy linear expression is just a list of terms plus a constant3*x[a] + 2*x[b] + 5 — stored as three aligned arrays: coeffs, vars (integer labels), const.

Every method below does one of three harmless things to that list:

  1. Regroup terms — collect terms from several cells into one. (sum, resample, coarsen, groupby)
  2. Copy terms — the same term appears in several output cells. (rolling)
  3. Move or drop whole cells — pick, fill, or discard cells without touching the terms inside. (reindex, shift, dropna, transpose)

and some also rescale — multiply every coefficient and the constant by a number/array. (mean, weighted, the .mean() variants)

None of this multiplies two variables together, so the result is always still a valid linear (or quadratic) expression. linopy already owns every building block: regrouping is the term-stacking trick behind .sum(); copying is the window trick behind .rolling().sum(); moving cells is ordinary .sel()/.reindex(); rescaling is ordinary *//.

Tier 1

  • mean = sum divided by how many things were summed: gen.mean("time") == gen.sum("time") / 3. Divide by the count of non-missing entries (matches xarray skipna=True and linopy's own sum, which already skips -1 variables). All-missing slice -> 0/0 = NaN, as xarray.
  • resample aggregates a datetime axis by calendar period. Each period's expression = sum of that period's terms = a groupby keyed by "which period". Ask pandas for the period label of each timestamp, reuse the existing groupby(...).sum() fast path; .mean() divides by the count. Decide: keep empty periods as 0 (xarray parity) or drop them. Forward closed/label/origin.
  • reindex / reindex_like (Variable) — new slots get the missing-sentinel (labels=-1, bounds NaN), same fill Expression/Constraint already use. One line each: reindex = varwrap(Dataset.reindex, fill_value=FILL_VALUE).
  • weightedgen.weighted(w).sum("time") == (gen * w).sum("time"). Pure sugar over * and sum; .mean() also divides by w.sum(). Just a small wrapper object.

Tier 2

  • coarsenresample's positional cousin (group every N rows). Reshape into blocks via xarray's coarsen().construct(), then sum each block; .mean() divides by block size. Handle boundary="trim"|"pad".
  • rolling().mean() / groupby().mean() — rolling/groupby sum already exist; .mean() divides by the window size / group size (or the valid count at window edges).
  • transpose — pure axis reorder, no arithmetic. Must reorder only real dims and keep the internal _term/_factor last. Plain wrap for Variable.
  • compute / chunk — no math, dask plumbing; add only to make the three classes consistent.
  • dropna — drops "missing" coordinate slices. "Missing" in linopy is not raw NaN: for Variable it is labels == -1; for an expression it is isnull() (all terms empty AND const NaN). Build the drop-mask from isnull().

Datetime-aware shift

Today shift is integer/positional only — shift(time=1) moves everything one row. Fine for a regular hourly index, used for storage balance soc[t] == soc[t-1] + charge[t]. But on irregular snapshots (variable time resolution, clustered representative periods) "one row" is not "one hour". There are three distinct operators:

  1. Integer shift — what exists today.
  2. Time shift — move by an actual duration, shift(time="1h"): for each timestamp find the row exactly one hour earlier; no such row -> missing. The correct operator for irregular grids and what storage/ramping constraints want. Basically a reindex onto "time minus offset" labels, so nearly free once Variable has reindex. Collapses to integer shift on a regular index.
  3. Index shift — keep the data, relabel the time axis by an offset (pandas shift(freq=...)).

pandas handles month-end/DST for offsets like "1ME". Adding datetime shift is also a good moment to fix the missing fill_value on Expression.shift / Constraint.shift (A above).


Open questions

  • mean: divide by the non-missing count, or by the raw length?
  • resample / coarsen: keep empty periods, or drop them?
  • rolling().mean() at the edges: divide by the window size, or by the valid count?
  • shift(time="1h"): time shift or index shift (or expose both)?
  • Add the new methods to QuadraticExpression and Constraint too, or just Variable / LinearExpression?
  • Should this ship as one PR per tier, or per method?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions