Skip to content

feat: add planetary-computer-multipart source, tests, and docs#610

Merged
floriankrb merged 6 commits intoecmwf:mainfrom
duncanmartyn:feat/planetary-computer-multipart
Apr 22, 2026
Merged

feat: add planetary-computer-multipart source, tests, and docs#610
floriankrb merged 6 commits intoecmwf:mainfrom
duncanmartyn:feat/planetary-computer-multipart

Conversation

@duncanmartyn
Copy link
Copy Markdown
Contributor

@duncanmartyn duncanmartyn commented Apr 14, 2026

Description

Adds a new source for multipart (multiple items and item assets) STAC collections on the open Microsoft Planetary Computer.

Design:

  • Date handling in execute: matching date(s) passed to the method to asset URIs is necessary to mitigate datetime not found warnings.
  • Source parent class rather than XarraySourceBase: requires multiple URIs and, possibly but unlikely, different storage options per asset. Also requires date matching to avoid warnings, which the current XarraySourceBase.execute does not support.
  • query.datetime config key as a repetition of the dates section: facilitates access to the dataset's datetime range in __init__, resulting in fewer STAC API queries. Querying in execute may result in as many API requests as there are timestamps in the dates.start to dates.end range given iterative invocation of the method.
  • Calendar handling: as with the rest of the project (from what I can tell), this source does not account for data not using the Gregorian calendar (e.g., CMIP6 models like UKESM1-0-LL using the 360-day calendar).
  • Parameterised test: enables testing of a collection with multiple timesteps per ABFS URI Zarr store and another with one timestep per HTTPS URL NetCDF file using identical test code.
  • Scoped to the open Planetary Computer: generalising to arbitrary STAC spec compliant catalogues would require parameterising the endpoint URL and authentication.
  • Placed in the same file as the original planetary-computer source due to conceptual similarity and shared dependencies - happy to move to its own file if preferred.

Changes:

Two new dependencies are required by this change to handle remote NetCDF files with Xarray: h5netcdf and h5py.

What problem does this change solve?

The existing planetary-computer source pertains only to STAC collections for which there is a collection-level dataset asset under the zarr-abfs key corresponding to a single Zarr store containing all data. This source enables the use of collections in which data are in separate files or stores and referenced in distinct items and assets thereof.

What issue or task does this change relate to?

Additional notes

As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/

By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.


📚 Documentation preview 📚: https://anemoi-datasets--610.org.readthedocs.build/en/610/

@github-project-automation github-project-automation Bot moved this to To be triaged in Anemoi-dev Apr 14, 2026
@github-actions github-actions Bot added contributor documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file tests labels Apr 14, 2026
@duncanmartyn duncanmartyn force-pushed the feat/planetary-computer-multipart branch from f0e9d89 to ac78c5c Compare April 14, 2026 15:16
@floriankrb
Copy link
Copy Markdown
Member

floriankrb commented Apr 15, 2026

I understand you need something more to have the auto-discovery of the different parts of the data, so the current planetary-computer source is not enough. Thank you for sharing this code, this is likely to be helpfull for other and should be merged.

It would be nice to keep only one source though, could you please refactor to include the multi part code in the same source and do the branching with an option?

planetary-computer:
   mode: direct
  ...
  

and

planetary-computer:
   mode: multipart
  ...
  

Feel free to use the appropriate vocabulary, perhaps target and multiplart and mode are not the right terms to use.

@duncanmartyn
Copy link
Copy Markdown
Contributor Author

Hi Florian, thanks for taking a look! My thinking behind not addint it to the original source was because the multipart doesn't (can't) use the XarraySourceBase. I wanted to avoid a scenario where the use / applicability of the parent class varied between wholly used with the singlepart zarr-abfs collections to not at all (beyond passing context) for the multipart collections that need a different execute method.

If I've got that wrong or you're happy with the above then I can definitely look into it. Thinking we could avoid a "mode" or similar config parameter by auto-detecting. That is, if the zarr-abfs key exists, use that, if not, assume multipart. Removes the burden of identifying the right mode from the user but being explicit might be preferable.

@floriankrb
Copy link
Copy Markdown
Member

That is, if the zarr-abfs key exists, use that, if not, assume multipart.
I looks like a good idea, yes.

Sorry for insisting on this, I understand that the code may become more complex, but having a simpler interface for the user is more important imho.

@duncanmartyn duncanmartyn force-pushed the feat/planetary-computer-multipart branch from 5f51db4 to 48e4362 Compare April 16, 2026 12:46
@duncanmartyn
Copy link
Copy Markdown
Contributor Author

duncanmartyn commented Apr 16, 2026

Not a problem , makes sense. Pending CI checks I've consolidated the sources and updated docs and tests:

  • Couldn't get flavour and rules working with the alternatively named level dimension for the met-office-global-deterministic-pressure test. In the data, the level dimension is named "pressure", and unless I passed a "pressure" parameter in the config (in place of "levels"), it wouldn't work. If you know the fix it'd be great for me to remedy that.
  • Retained the original data_catalog_id arg rather than collection_id for backwards compat
  • Changed filtering to explicit CQL2 in the config with support for string or dict formatting. Gives more flexibility where the previous auto-built a CQL2 query limited to AND and "=" filters.

EDIT: in any case, the test I mentioned looks to make CI significantly slower. Happy to remove as it may be surplus anyway.
EDIT2: removed the test. Prior to this it passed as expected in the 3.11 and 3.12 checks but stalled during 3.13 for some reason.

@floriankrb floriankrb self-requested a review April 22, 2026 09:57
@github-project-automation github-project-automation Bot moved this from To be triaged to For merging in Anemoi-dev Apr 22, 2026
@floriankrb
Copy link
Copy Markdown
Member

Approved.

This should be merged as soon as the branch is updated. @duncanmartyn , I will let you to update or rebase.

As a follow up, perhaps another PR, I wonder if this could be extended/share code with a stac source.

@duncanmartyn duncanmartyn force-pushed the feat/planetary-computer-multipart branch from 95539b0 to 504a85c Compare April 22, 2026 10:26
@duncanmartyn
Copy link
Copy Markdown
Contributor Author

@floriankrb thanks, rebased. Looks like it needs an ATS label if you're able to add that, please.

On the shared / extended source, agreed. It's mainly a question of the differences between static and dynamic catalogues, of which Planetary Computer is the latter, but I'll look into it!

@floriankrb floriankrb merged commit 42117db into ecmwf:main Apr 22, 2026
14 of 16 checks passed
@github-project-automation github-project-automation Bot moved this from For merging to Done in Anemoi-dev Apr 22, 2026
floriankrb pushed a commit that referenced this pull request Apr 22, 2026
🤖 Automated Release PR

This PR was created by `release-please` to prepare the next release.
Once merged:

1. A new version tag will be created
2. A GitHub release will be published
3. The changelog will be updated

Changes to be included in the next release:
---


##
[0.5.36](0.5.35...0.5.36)
(2026-04-22)


### Features

* Add CycleIntervalProvider and set_start_step_to_zero patch
([#564](#564))
([2c8824c](2c8824c))
* Add planetary-computer-multipart source, tests, and docs
([#610](#610))
([42117db](42117db))
* **create:** Add workaround for missing data at step zero
([#565](#565))
([9fd4733](9fd4733))
* Fetch files from ecfs if path starts with ec: or ectmp:
([#585](#585))
([9fb443a](9fb443a))
* Fix issue 569
([#574](#574))
([7f4e40a](7f4e40a))
* Fix typo with duplicates
([#580](#580))
([f33333e](f33333e))
* Make anemoi-datasets agnostic to Zarr version (Optional support Zarr3)
([#220](#220))
([ab8cd71](ab8cd71))
* Observations feature branch
([#480](#480))
([92d5ac9](92d5ac9))
* Open datasets analytics
([#576](#576))
([561dbd2](561dbd2))
* Remove https test
([#608](#608))
([048e419](048e419))


### Bug Fixes

* **create:** Repeated-dates
([#572](#572))
([b73d533](b73d533))
* Example accumulations section to user current accumulate API
([#601](#601))
([9434007](9434007))
* Fix corner cases
([#594](#594))
([bdd31ff](bdd31ff))
* Fix race condition during build
([#593](#593))
([66e2070](66e2070))
* Fix read ahead while building
([#611](#611))
([6d18e5e](6d18e5e))
* Fix weatherbench test
([#609](#609))
([f434a15](f434a15))
* **grib-index:** Support querying float values
([#520](#520))
([b089cd2](b089cd2))
* Improve MARS request handling for forecast datasets
([#562](#562))
([f9efe39](f9efe39))
* Make dataset naming function public
([#579](#579))
([b089bb0](b089bb0))
* Netcdf date/time metadata type should be int
([#555](#555))
([9937fbe](9937fbe))
* Propagate resolution metadata when using anemoi_dataset source
([#614](#614))
([784695c](784695c))
* Remove duplicate code
([#590](#590))
([8e54420](8e54420))
* Remove empty accumulators from accumulation computation
([#561](#561))
([3bc087d](3bc087d))
* Replace pydantic class Config with ConfigDict
([#592](#592))
([ce6b2ff](ce6b2ff))
* Rolling average regression
([#587](#587))
([04f5b0b](04f5b0b))


### Documentation

* Docs minor fixes update concat yaml
([#539](#539))
([dd73fda](dd73fda))

---
> [!IMPORTANT]
> Please do not change the PR title, manifest file, or any other
automatically generated content in this PR unless you understand the
implications. Changes here can break the release process.
> ⚠️ Merging this PR will:
> - Create a new release
> - Trigger deployment pipelines
> - Update package versions

 **Before merging:**
 - Ensure all tests pass
 - Review the changelog carefully
 - Get required approvals

[Release-please
documentation](https://github.com/googleapis/release-please)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ATS Approval not needed contributor dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation tests

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants