⭐ added various functions to the binning module by philippemiron · Pull Request #567 · Cloud-Drift/clouddrift

philippemiron · 2025-06-28T16:44:10Z

Created this since I have worked on extending the binning module (closes #565).

extending to more statistics than just the current mean (see, twodstats.m)
automatic handling of datetime in coordinates/variables
add new tests for the other functions
~~think about how we can abstract the "histogram" part so we could use other methods like polynomial fitting~~ (another PR)
fix anonymous function name and name collisons

…return type

Co-authored-by: Copilot <[email protected]>

KevinShuman · 2025-08-04T14:39:59Z

What is the expected output for the test above? Running it, I am getting the following output/error:

https://www.aoml.noaa.gov/phod/float_traj/files/allFloats_12122017.mat: 100%|██████████| 27.8k/27.8k [00:06<00:00, 4.58MB/s]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], [line 12](vscode-notebook-cell:?execution_count=1&line=12)
      9 variables = [ds_float.ve, ds_float.vn]
     11 # Compute 3D binned averages
---> [12](vscode-notebook-cell:?execution_count=1&line=12) ds = binned_statistics(
     13     coords=coords,
     14     data=variables,
     15     bins=[75, 30, 10],
     16     dim_names=("lon", "lat", "time"),
     17     output_names=["ve", "vn"],
     18     statistics=["count", "sum", "mean", "median", "std", "min", "max"],
     19 )
     21 ds["vn_median"].isel(time=9).plot(x="lon", y="lat", cmap="viridis")

File /vol/clouddrift/clouddrift/binning.py:523, in binned_statistics(coords, data, bins, bins_range, dim_names, output_names, statistics)
    520 D, N = coords.shape
    522 # validate coordinates are finite
--> [523](https://vscode-remote+attached-002dcontainer-002b7b22636f6e7461696e65724e616d65223a222f636c6f75645f64726966745f6465765f636f6e7461696e6572222c2273657474696e6773223a7b22636f6e74657874223a226465736b746f702d6c696e7578227d7d.vscode-resource.vscode-cdn.net/vol/clouddrift/clouddrift/binning.py:523) if any(~np.isfinite(c).all() for c in coords):
    524     raise ValueError("Coordinates must be finite values.")
    526 # V, VN = number of variables and number of data points per variable

File /vol/clouddrift/clouddrift/binning.py:523, in <genexpr>(.0)
    520 D, N = coords.shape
    522 # validate coordinates are finite
--> [523](https://vscode-remote+attached-002dcontainer-002b7b22636f6e7461696e65724e616d65223a222f636c6f75645f64726966745f6465765f636f6e7461696e6572222c2273657474696e6773223a7b22636f6e74657874223a226465736b746f702d6c696e7578227d7d.vscode-resource.vscode-cdn.net/vol/clouddrift/clouddrift/binning.py:523) if any(~np.isfinite(c).all() for c in coords):
    524     raise ValueError("Coordinates must be finite values.")
    526 # V, VN = number of variables and number of data points per variable

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

The following tests are also failing because of this:

FAILED tests/binning_test.py::binning_tests::test_statistics_datetime_coords - TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according ...
FAILED tests/binning_test.py::binning_tests::test_statistics_datetime_data - TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according ...
FAILED tests/binning_test.py::binning_tests::test_statistics_datetime_data_sum - TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according ...

philippemiron · 2025-08-04T16:51:22Z

Sorry, I didn't realize this. Will try fixing right now.

Copilot

Pull Request Overview

This PR extends the binning module functionality to support various statistical computations beyond just counting and mean calculations. The changes enable automatic handling of datetime coordinates/variables and add comprehensive testing for the new features.

Extended statistics support to include count, sum, mean, median, std, min, max, and custom functions
Added automatic datetime handling in coordinates and data variables
Enhanced naming conventions and collision avoidance for output variables

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
tests/binning_test.py	Adds comprehensive tests for new statistics functions, datetime handling, and error cases
clouddrift/binning.py	Implements new statistics functions, datetime conversion utilities, and enhanced parameter validation
.github/workflows/ci.yml	Updates CI workflow conditions for coverage reporting

clouddrift/binning.py

Co-authored-by: Copilot <[email protected]>

…s function

selipot

I put comments inline

clouddrift/binning.py

selipot · 2025-08-13T13:30:01Z

clouddrift/binning.py

        - A list of ints or arrays: one per dimension, specifying either bin count or bin edges,
        - None: defaults to 10 bins per dimension.
    bins_range : list of tuples, optional
        Outer bin limits for each dimension.


Is it possible to provide a range for only a subset of the dimensions or do we have to provide the ranges for all the dimensions when this optional argument is given?

You can provide range for a subset of the variables. I need to add tests to this actually. If you do this: [[-90, 90], None], it should apply the range only on the first variable and take min/max for the second one.

selipot · 2025-08-13T13:39:39Z

clouddrift/binning.py

+        - a tuple of (output_name, callable) for multivariate statistics. 'output_name' is used to identify the resulting variable.
+          In this case, the callable will receive the list of arrays provided in `data`. For example, to calculate kinetic energy,
+          you can pass `data = [u, v]` and  `statistics=("ke", lambda data: np.sqrt(np.mean(data[0] ** 2 + data[1] ** 2)))`.
+        - a list containing any combination of the above, e.g., ['mean', np.nanmax, ('ke', lambda data: np.sqrt(np.mean(data[0] ** 2 + data[1] ** 2)))].


Here you suggest np.nanmax because the standard 'max' statistic will return np.nan if the data contain np.nan values?

To be fair, I didn't think about this. I just included a random numpy function as an example.

selipot · 2025-08-13T13:42:18Z

clouddrift/binning.py

+        + statistics_func
+    )
+
+    if statistics and not data.size:


Mypy seems to complain about the .size attribute of data?

does it? I don't see this in the tests below?

selipot · 2025-08-13T14:29:22Z

If I do the following:

coords = [ds.lon, ds.lat]
variables = [ds.ve, ds.vn, ds.temp]

ds_stats = binned_statistics(
    coords=coords,
    data=variables,
    bins=[360,180],
    bins_range=[(-180, 180), (-90, 90)],
    dim_names=["lon", "lat"],
    output_names=["ve", "vn", "temp"],
    statistics=['mean',
        ('q25', lambda x: np.percentile(x, q=2.5)),
        ('q975', lambda x: np.percentile(x, q=97.5)),
        ('skew', lambda x: skew(x, axis=None, nan_policy='omit')),
    ],
)

only the mean statistic get applied to each of the variables. I was expecting the custom callables q25, q975, and skew to be applied to each.Not sure which variable these were applied?

…to binning-plus

philippemiron · 2025-08-13T16:24:22Z

Hi @selipot,

In the end I don't think anything is wrong, just maybe confusing. If you use the tuple in statistics, this is for the multivariate statistics. In that situation, the x will be a list containing all the data variables. So you would have to do something like this:

coords = [ds.lon, ds.lat]
variables = [ds.ve, ds.vn, ds.temp]

ds_stats = binned_statistics(
    coords=coords,
    data=variables,
    bins=[360,180],
    bins_range=[(-180, 180), (-90, 90)],
    dim_names=["lon", "lat"],
    output_names=["ve", "vn", "temp"],
    statistics=['mean',
        ('ve_q25', lambda x: np.percentile(x[0], q=2.5)),
        ('ve_q975', lambda x: np.percentile(x[0], q=97.5)),
        ('ve_skew', lambda x: skew(x[0], axis=None, nan_policy='omit')),
        ('vn_q25', lambda x: np.percentile(x[1], q=2.5)),
        ('vn_q975', lambda x: np.percentile(x[1], q=97.5)),
        ('vn_skew', lambda x: skew(x[1], axis=None, nan_policy='omit')),
        ('temp_q25', lambda x: np.percentile(x[2], q=2.5)),
        ('temp_q975', lambda x: np.percentile(x[2], q=97.5)),
        ('temp_skew', lambda x: skew(x[2], axis=None, nan_policy='omit')),
    ],
)

but here if you want to apply the same function to all variables, you just pass a Callable instead of a tuple(name, Callable).

coords = [ds.lon, ds.lat]
variables = [ds.ve, ds.vn, ds.temp]

ds_stats = binned_statistics(
    coords=coords,
    data=variables,
    bins=[360,180],
    bins_range=[(-180, 180), (-90, 90)],
    dim_names=["lon", "lat"],
    output_names=["ve", "vn", "temp"],
    statistics=['mean',
        lambda x: np.percentile(x[0], q=2.5),
        lambda x: np.percentile(x[0], q=97.5),
        lambda x: skew(x[0], axis=None, nan_policy='omit')
    ],
)

because it is an anonymous function, the variable names will be ve_stat_0, vn_stat_0, ve_stat_1, etc.

Let me know if you have suggestions on how to make this clearer.

philippemiron · 2025-08-13T16:35:44Z

add tests bins_range only on the subset of variables
uniform docstring formatting

selipot · 2025-08-13T17:14:43Z

Hi @selipot,

In the end I don't think anything is wrong, just maybe confusing. If you use the tuple in statistics, this is for the multivariate statistics. In that situation, the x will be a list containing all the data variables. So you would have to do something like this:

coords = [ds.lon, ds.lat]
variables = [ds.ve, ds.vn, ds.temp]

ds_stats = binned_statistics(
    coords=coords,
    data=variables,
    bins=[360,180],
    bins_range=[(-180, 180), (-90, 90)],
    dim_names=["lon", "lat"],
    output_names=["ve", "vn", "temp"],
    statistics=['mean',
        ('ve_q25', lambda x: np.percentile(x[0], q=2.5)),
        ('ve_q975', lambda x: np.percentile(x[0], q=97.5)),
        ('ve_skew', lambda x: skew(x[0], axis=None, nan_policy='omit')),
        ('vn_q25', lambda x: np.percentile(x[1], q=2.5)),
        ('vn_q975', lambda x: np.percentile(x[1], q=97.5)),
        ('vn_skew', lambda x: skew(x[1], axis=None, nan_policy='omit')),
        ('temp_q25', lambda x: np.percentile(x[2], q=2.5)),
        ('temp_q975', lambda x: np.percentile(x[2], q=97.5)),
        ('temp_skew', lambda x: skew(x[2], axis=None, nan_policy='omit')),
    ],
)

but here if you want to apply the same function to all variables, you just pass a Callable instead of a tuple(name, Callable).

coords = [ds.lon, ds.lat]
variables = [ds.ve, ds.vn, ds.temp]

ds_stats = binned_statistics(
    coords=coords,
    data=variables,
    bins=[360,180],
    bins_range=[(-180, 180), (-90, 90)],
    dim_names=["lon", "lat"],
    output_names=["ve", "vn", "temp"],
    statistics=['mean',
        lambda x: np.percentile(x[0], q=2.5),
        lambda x: np.percentile(x[0], q=97.5),
        lambda x: skew(x[0], axis=None, nan_policy='omit')
    ],
)

because it is an anonymous function, the variable names will be ve_stat_0, vn_stat_0, ve_stat_1, etc.

Let me know if you have suggestions on how to make this clearer.

Ok, in the second option you propose, I do not think the Callable should be lambda x: np.percentile(x[0], q=2.5) but should be lambda x: np.percentile(x, q=2.5). Is this correct? Otherwise you are passing only the first element to the computation?

I don't want to ask tooooo much but I think the output I originally expected would be nicer :) or makes more sense, i.e. get ve_q25, vn_q25, etc.

philippemiron · 2025-08-13T18:23:32Z

Yes, it's the first element of the data = [ds.ve, ds.vn, ds.temp] variables, which would be the array ds.ve in your case.

The thing is the syntax with a tuple is to do things like this: ('ke', lambda data: np.sqrt(np.mean(data[0] ** 2 + data[1] ** 2))), where we might need to use multiple variables. In that case, I required passing a "new" variable name because otherwise it would be hard to figure out automatically what is the output.

In your case, if you want to apply a function to all variables, you can pass lambda x: np.percentile(x[0], q=2.5),, but because a lambda doesn't have a .__name__ attribute, I decided to set the output_names automatically to stat. With a regular function, e.g. np.mean -> "mean" or partial -> "function name" as below, the name of the output variables are set automatically.

coords = [ds.lon.values, ds.lat.values]
variables = [ds.ve.values, ds.vn.values]

def top_five_percent(x):
    return np.percentile(x, q=95)

# Compute 3D binned averages
ds_binned = binned_statistics(
    coords=coords,
    data=variables,
    bins=[180, 90],
    dim_names=("lon", "lat"),
    output_names=["ve", "vn"],
    statistics=["mean", partial(top_five_percent)],
)

variables here are set to ve_top_five_percent, vn_top_five_percent.

philippemiron · 2025-08-27T16:47:37Z

this is good to go @KevinShuman

Philippe Miron and others added 30 commits June 1, 2025 20:40

init

df9f3f9

add ipykernel to use notebook

1b660f1

format

56e00f1

switch to multidimensional

b12129c

add time conversion utility functions

8bec47a

fix type

6bf8563

change name

352807e

add constant for defaut bins nb

ecfd05f

digitize and rename

58864c5

remove unused function

9857943

init tests

1babfd1

more tests

1dbb498

rename

8b1e464

remove unused module

12e0b2f

more tests

ddae9f1

more improvements

6fa53c5

fix mypy

c1f0b96

small issues with type

007b61e

revert

2db3f07

more tests

49dfb6b

not needed

beb52dd

rename parameters

4ae5612

tried improving the docstring

4abb6d9

add module to documentation

0bb1dd3

copilot forced me to do it

b66abbf

fix docstring

6f007b9

fix bug adding list with DataArray

598b8a7

better to check Iterable instead of every possible types

a15ae54

remove unused import

afbd753

fix docstring for histogram function to clarify default behavior and …

555388b

…return type

finite coords

377cc23

philippemiron requested a review from Copilot August 4, 2025 12:54

This comment was marked as outdated.

Sign in to view

Update clouddrift/binning.py

626e44d

Co-authored-by: Copilot <[email protected]>

Philippe Miron added 4 commits August 4, 2025 13:38

fix tests

f51888d

typo

3cc8d6b

improve coverage

a3e35a6

complete coverage

3a7e040

philippemiron requested a review from Copilot August 9, 2025 00:25

Copilot AI reviewed Aug 9, 2025

View reviewed changes

clouddrift/binning.py Outdated Show resolved Hide resolved

philippemiron and others added 3 commits August 8, 2025 20:26

Update clouddrift/binning.py

46895fe

Co-authored-by: Copilot <[email protected]>

small docstring changes

0370725

Clarify documentation for multivariate statistics in binned_statistic…

7cdaa74

…s function

selipot reviewed Aug 13, 2025

View reviewed changes

Merge branch 'binning-plus' of github.com:philippemiron/clouddrift in…

ae8c137

…to binning-plus

docstring default

d6e7e78

Philippe Miron added 2 commits August 13, 2025 12:36

fix mean docstring

e71f204

multiline docstring starts on newline

9b49668

Philippe Miron added 3 commits August 20, 2025 15:29

add partial test

4a1dda6

move function top of the module

653ad63

bump version

cfa076c

KevinShuman merged commit 8303c08 into Cloud-Drift:main Aug 27, 2025
5 of 14 checks passed

Conversation

philippemiron commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

KevinShuman commented Aug 4, 2025

Uh oh!

philippemiron commented Aug 4, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

selipot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

selipot Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

philippemiron Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

selipot Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

philippemiron Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

selipot Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

philippemiron Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

selipot commented Aug 13, 2025

Uh oh!

philippemiron commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

philippemiron commented Aug 13, 2025

Uh oh!

selipot commented Aug 13, 2025

Uh oh!

philippemiron commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

philippemiron commented Aug 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

philippemiron commented Jun 28, 2025 •

edited

Loading

philippemiron Aug 13, 2025 •

edited

Loading

philippemiron commented Aug 13, 2025 •

edited

Loading

philippemiron commented Aug 13, 2025 •

edited

Loading