Skip to content

⭐ added various functions to the binning module#567

Merged
KevinShuman merged 96 commits intoCloud-Drift:mainfrom
philippemiron:binning-plus
Aug 27, 2025
Merged

⭐ added various functions to the binning module#567
KevinShuman merged 96 commits intoCloud-Drift:mainfrom
philippemiron:binning-plus

Conversation

@philippemiron
Copy link
Copy Markdown
Contributor

@philippemiron philippemiron commented Jun 28, 2025

Created this since I have worked on extending the binning module (closes #565).

  • extending to more statistics than just the current mean (see, twodstats.m)
  • automatic handling of datetime in coordinates/variables
  • add new tests for the other functions
  • think about how we can abstract the "histogram" part so we could use other methods like polynomial fitting (another PR)
  • fix anonymous function name and name collisons

@philippemiron philippemiron requested a review from Copilot August 4, 2025 12:54

This comment was marked as outdated.

@KevinShuman
Copy link
Copy Markdown
Collaborator

What is the expected output for the test above? Running it, I am getting the following output/error:

https://www.aoml.noaa.gov/phod/float_traj/files/allFloats_12122017.mat: 100%|██████████| 27.8k/27.8k [00:06<00:00, 4.58MB/s]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], [line 12](vscode-notebook-cell:?execution_count=1&line=12)
      9 variables = [ds_float.ve, ds_float.vn]
     11 # Compute 3D binned averages
---> [12](vscode-notebook-cell:?execution_count=1&line=12) ds = binned_statistics(
     13     coords=coords,
     14     data=variables,
     15     bins=[75, 30, 10],
     16     dim_names=("lon", "lat", "time"),
     17     output_names=["ve", "vn"],
     18     statistics=["count", "sum", "mean", "median", "std", "min", "max"],
     19 )
     21 ds["vn_median"].isel(time=9).plot(x="lon", y="lat", cmap="viridis")

File /vol/clouddrift/clouddrift/binning.py:523, in binned_statistics(coords, data, bins, bins_range, dim_names, output_names, statistics)
    520 D, N = coords.shape
    522 # validate coordinates are finite
--> [523](https://vscode-remote+attached-002dcontainer-002b7b22636f6e7461696e65724e616d65223a222f636c6f75645f64726966745f6465765f636f6e7461696e6572222c2273657474696e6773223a7b22636f6e74657874223a226465736b746f702d6c696e7578227d7d.vscode-resource.vscode-cdn.net/vol/clouddrift/clouddrift/binning.py:523) if any(~np.isfinite(c).all() for c in coords):
    524     raise ValueError("Coordinates must be finite values.")
    526 # V, VN = number of variables and number of data points per variable

File /vol/clouddrift/clouddrift/binning.py:523, in <genexpr>(.0)
    520 D, N = coords.shape
    522 # validate coordinates are finite
--> [523](https://vscode-remote+attached-002dcontainer-002b7b22636f6e7461696e65724e616d65223a222f636c6f75645f64726966745f6465765f636f6e7461696e6572222c2273657474696e6773223a7b22636f6e74657874223a226465736b746f702d6c696e7578227d7d.vscode-resource.vscode-cdn.net/vol/clouddrift/clouddrift/binning.py:523) if any(~np.isfinite(c).all() for c in coords):
    524     raise ValueError("Coordinates must be finite values.")
    526 # V, VN = number of variables and number of data points per variable

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

The following tests are also failing because of this:

FAILED tests/binning_test.py::binning_tests::test_statistics_datetime_coords - TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according ...
FAILED tests/binning_test.py::binning_tests::test_statistics_datetime_data - TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according ...
FAILED tests/binning_test.py::binning_tests::test_statistics_datetime_data_sum - TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according ...

@philippemiron
Copy link
Copy Markdown
Contributor Author

Sorry, I didn't realize this. Will try fixing right now.

@philippemiron philippemiron requested a review from Copilot August 9, 2025 00:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR extends the binning module functionality to support various statistical computations beyond just counting and mean calculations. The changes enable automatic handling of datetime coordinates/variables and add comprehensive testing for the new features.

  • Extended statistics support to include count, sum, mean, median, std, min, max, and custom functions
  • Added automatic datetime handling in coordinates and data variables
  • Enhanced naming conventions and collision avoidance for output variables

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
tests/binning_test.py Adds comprehensive tests for new statistics functions, datetime handling, and error cases
clouddrift/binning.py Implements new statistics functions, datetime conversion utilities, and enhanced parameter validation
.github/workflows/ci.yml Updates CI workflow conditions for coverage reporting

Copy link
Copy Markdown
Member

@selipot selipot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put comments inline

- A list of ints or arrays: one per dimension, specifying either bin count or bin edges,
- None: defaults to 10 bins per dimension.
bins_range : list of tuples, optional
Outer bin limits for each dimension.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to provide a range for only a subset of the dimensions or do we have to provide the ranges for all the dimensions when this optional argument is given?

Copy link
Copy Markdown
Contributor Author

@philippemiron philippemiron Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can provide range for a subset of the variables. I need to add tests to this actually. If you do this: [[-90, 90], None], it should apply the range only on the first variable and take min/max for the second one.

- a tuple of (output_name, callable) for multivariate statistics. 'output_name' is used to identify the resulting variable.
In this case, the callable will receive the list of arrays provided in `data`. For example, to calculate kinetic energy,
you can pass `data = [u, v]` and `statistics=("ke", lambda data: np.sqrt(np.mean(data[0] ** 2 + data[1] ** 2)))`.
- a list containing any combination of the above, e.g., ['mean', np.nanmax, ('ke', lambda data: np.sqrt(np.mean(data[0] ** 2 + data[1] ** 2)))].
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you suggest np.nanmax because the standard 'max' statistic will return np.nan if the data contain np.nan values?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair, I didn't think about this. I just included a random numpy function as an example.

+ statistics_func
)

if statistics and not data.size:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mypy seems to complain about the .size attribute of data?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it? I don't see this in the tests below?

@selipot
Copy link
Copy Markdown
Member

selipot commented Aug 13, 2025

If I do the following:

coords = [ds.lon, ds.lat]
variables = [ds.ve, ds.vn, ds.temp]

ds_stats = binned_statistics(
    coords=coords,
    data=variables,
    bins=[360,180],
    bins_range=[(-180, 180), (-90, 90)],
    dim_names=["lon", "lat"],
    output_names=["ve", "vn", "temp"],
    statistics=['mean',
        ('q25', lambda x: np.percentile(x, q=2.5)),
        ('q975', lambda x: np.percentile(x, q=97.5)),
        ('skew', lambda x: skew(x, axis=None, nan_policy='omit')),
    ],
)

only the mean statistic get applied to each of the variables. I was expecting the custom callables q25, q975, and skew to be applied to each.Not sure which variable these were applied?

Screenshot 2025-08-13 at 10 29 10 AM

@philippemiron
Copy link
Copy Markdown
Contributor Author

philippemiron commented Aug 13, 2025

Hi @selipot,

In the end I don't think anything is wrong, just maybe confusing. If you use the tuple in statistics, this is for the multivariate statistics. In that situation, the x will be a list containing all the data variables. So you would have to do something like this:

coords = [ds.lon, ds.lat]
variables = [ds.ve, ds.vn, ds.temp]

ds_stats = binned_statistics(
    coords=coords,
    data=variables,
    bins=[360,180],
    bins_range=[(-180, 180), (-90, 90)],
    dim_names=["lon", "lat"],
    output_names=["ve", "vn", "temp"],
    statistics=['mean',
        ('ve_q25', lambda x: np.percentile(x[0], q=2.5)),
        ('ve_q975', lambda x: np.percentile(x[0], q=97.5)),
        ('ve_skew', lambda x: skew(x[0], axis=None, nan_policy='omit')),
        ('vn_q25', lambda x: np.percentile(x[1], q=2.5)),
        ('vn_q975', lambda x: np.percentile(x[1], q=97.5)),
        ('vn_skew', lambda x: skew(x[1], axis=None, nan_policy='omit')),
        ('temp_q25', lambda x: np.percentile(x[2], q=2.5)),
        ('temp_q975', lambda x: np.percentile(x[2], q=97.5)),
        ('temp_skew', lambda x: skew(x[2], axis=None, nan_policy='omit')),
    ],
)

but here if you want to apply the same function to all variables, you just pass a Callable instead of a tuple(name, Callable).

coords = [ds.lon, ds.lat]
variables = [ds.ve, ds.vn, ds.temp]

ds_stats = binned_statistics(
    coords=coords,
    data=variables,
    bins=[360,180],
    bins_range=[(-180, 180), (-90, 90)],
    dim_names=["lon", "lat"],
    output_names=["ve", "vn", "temp"],
    statistics=['mean',
        lambda x: np.percentile(x[0], q=2.5),
        lambda x: np.percentile(x[0], q=97.5),
        lambda x: skew(x[0], axis=None, nan_policy='omit')
    ],
)

because it is an anonymous function, the variable names will be ve_stat_0, vn_stat_0, ve_stat_1, etc.

Let me know if you have suggestions on how to make this clearer.

@philippemiron
Copy link
Copy Markdown
Contributor Author

  • add tests bins_range only on the subset of variables
  • uniform docstring formatting

@selipot
Copy link
Copy Markdown
Member

selipot commented Aug 13, 2025

Hi @selipot,

In the end I don't think anything is wrong, just maybe confusing. If you use the tuple in statistics, this is for the multivariate statistics. In that situation, the x will be a list containing all the data variables. So you would have to do something like this:

coords = [ds.lon, ds.lat]
variables = [ds.ve, ds.vn, ds.temp]

ds_stats = binned_statistics(
    coords=coords,
    data=variables,
    bins=[360,180],
    bins_range=[(-180, 180), (-90, 90)],
    dim_names=["lon", "lat"],
    output_names=["ve", "vn", "temp"],
    statistics=['mean',
        ('ve_q25', lambda x: np.percentile(x[0], q=2.5)),
        ('ve_q975', lambda x: np.percentile(x[0], q=97.5)),
        ('ve_skew', lambda x: skew(x[0], axis=None, nan_policy='omit')),
        ('vn_q25', lambda x: np.percentile(x[1], q=2.5)),
        ('vn_q975', lambda x: np.percentile(x[1], q=97.5)),
        ('vn_skew', lambda x: skew(x[1], axis=None, nan_policy='omit')),
        ('temp_q25', lambda x: np.percentile(x[2], q=2.5)),
        ('temp_q975', lambda x: np.percentile(x[2], q=97.5)),
        ('temp_skew', lambda x: skew(x[2], axis=None, nan_policy='omit')),
    ],
)

but here if you want to apply the same function to all variables, you just pass a Callable instead of a tuple(name, Callable).

coords = [ds.lon, ds.lat]
variables = [ds.ve, ds.vn, ds.temp]

ds_stats = binned_statistics(
    coords=coords,
    data=variables,
    bins=[360,180],
    bins_range=[(-180, 180), (-90, 90)],
    dim_names=["lon", "lat"],
    output_names=["ve", "vn", "temp"],
    statistics=['mean',
        lambda x: np.percentile(x[0], q=2.5),
        lambda x: np.percentile(x[0], q=97.5),
        lambda x: skew(x[0], axis=None, nan_policy='omit')
    ],
)

because it is an anonymous function, the variable names will be ve_stat_0, vn_stat_0, ve_stat_1, etc.

Let me know if you have suggestions on how to make this clearer.

Ok, in the second option you propose, I do not think the Callable should be lambda x: np.percentile(x[0], q=2.5) but should be lambda x: np.percentile(x, q=2.5). Is this correct? Otherwise you are passing only the first element to the computation?

I don't want to ask tooooo much but I think the output I originally expected would be nicer :) or makes more sense, i.e. get ve_q25, vn_q25, etc.

@philippemiron
Copy link
Copy Markdown
Contributor Author

philippemiron commented Aug 13, 2025

Yes, it's the first element of the data = [ds.ve, ds.vn, ds.temp] variables, which would be the array ds.ve in your case.

The thing is the syntax with a tuple is to do things like this: ('ke', lambda data: np.sqrt(np.mean(data[0] ** 2 + data[1] ** 2))), where we might need to use multiple variables. In that case, I required passing a "new" variable name because otherwise it would be hard to figure out automatically what is the output.

In your case, if you want to apply a function to all variables, you can pass lambda x: np.percentile(x[0], q=2.5),, but because a lambda doesn't have a .__name__ attribute, I decided to set the output_names automatically to stat. With a regular function, e.g. np.mean -> "mean" or partial -> "function name" as below, the name of the output variables are set automatically.

coords = [ds.lon.values, ds.lat.values]
variables = [ds.ve.values, ds.vn.values]

def top_five_percent(x):
    return np.percentile(x, q=95)

# Compute 3D binned averages
ds_binned = binned_statistics(
    coords=coords,
    data=variables,
    bins=[180, 90],
    dim_names=("lon", "lat"),
    output_names=["ve", "vn"],
    statistics=["mean", partial(top_five_percent)],
)

variables here are set to ve_top_five_percent, vn_top_five_percent.

@philippemiron
Copy link
Copy Markdown
Contributor Author

this is good to go @KevinShuman

@KevinShuman KevinShuman merged commit 8303c08 into Cloud-Drift:main Aug 27, 2025
5 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

⭐ binning improvements

4 participants