Skip to content

Add gdp_sensor module for processing Global Drifter Program sensor#564

Draft
selipot wants to merge 2 commits intoCloud-Drift:mainfrom
selipot:gdp-s-adapter
Draft

Add gdp_sensor module for processing Global Drifter Program sensor#564
selipot wants to merge 2 commits intoCloud-Drift:mainfrom
selipot:gdp-s-adapter

Conversation

@selipot
Copy link
Copy Markdown
Member

@selipot selipot commented Jun 4, 2025

This is a PR to potentially add an adapter for the GDP s files. It has a number of issues. After downloading locally the s files from https://www.aoml.noaa.gov/ftp/pub/phod/pub/pazos/data/shane/sst

I am testing the code with

from clouddrift.adapters.gdp import gdp_sensor
ra = gdp_sensor.to_raggedarray(tmp_path='/Users/selipot/Data/drifters/raw/',skip_download=True)

But I had to manually do the following:
on line 3791269 of buoydata_1_5000_edited_sfiles.data, deleted manually

7720663   10 14.098 1996   1000.00   1000.00   1000.00    233.41    270.72**********

on lines 3808058 and 3808059, wrong string 2.00-111706.63
which I deleted manually.

Also line 3851218 deleted manually

7720673   10 19.099 1996   1000.00   1000.00      2.00**********   9145.48    184.70

Then I am running into the error that

_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.13/concurrent/futures/process.py", line 254, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/Users/selipot/projects.git/clouddrift/clouddrift/adapters/gdp/gdp_sensor.py", line 371, in _process_chunk
    df_chunk = _apply_remove(
        preremove_df_chunk,
    ...<11 lines>...
        ],
    )
  File "/Users/selipot/projects.git/clouddrift/clouddrift/adapters/gdp/gdp_sensor.py", line 317, in _apply_remove
    mask = filter_(temp_df)
  File "/Users/selipot/projects.git/clouddrift/clouddrift/adapters/gdp/gdp_sensor.py", line 377, in <lambda>
    lambda df: (df["senObsYear"] > datetime.datetime.now().year)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.13/site-packages/pandas/core/ops/common.py", line 76, in new_method
    return method(self, other)
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.13/site-packages/pandas/core/arraylike.py", line 56, in __gt__
    return self._cmp_method(other, operator.gt)
           ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.13/site-packages/pandas/core/series.py", line 6119, in _cmp_method
    res_values = ops.comparison_op(lvalues, rvalues, op)
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.13/site-packages/pandas/core/ops/array_ops.py", line 344, in comparison_op
    res_values = comp_method_OBJECT_ARRAY(op, lvalues, rvalues)
  File "/opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.13/site-packages/pandas/core/ops/array_ops.py", line 129, in comp_method_OBJECT_ARRAY
    result = libops.scalar_compare(x.ravel(), y, op)
  File "ops.pyx", line 107, in pandas._libs.ops.scalar_compare
TypeError: '>' not supported between instances of 'str' and 'int'
"""

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 ra = gdp_sensor.to_raggedarray(tmp_path='/Users/selipot/Data/drifters/raw/',skip_download=True)

File ~/projects.git/clouddrift/clouddrift/adapters/gdp/gdp_sensor.py:630, in to_raggedarray(tmp_path, skip_download, max, chunk_size, use_fill_values, max_chunks)
    627 gdp_metadata_df = get_gdp_metadata(tmp_path)
    629 # Run async process to parallelize data processing.
--> 630 drifter_datasets = asyncio.run(
    631     _parallel_get(
    632         [dst for (_, dst) in requests],
    633         gdp_metadata_df,
    634         chunk_size,
    635         tmp_path,
    636         use_fill_values,
    637         max_chunks,
    638     )
    639 )
    641 # Sort the drifters by their start date.
    642 deploy_date_id_map = {
    643     ds["id"].data[0]: ds["start_date"].data[0] for ds in drifter_datasets
    644 }

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.13/asyncio/runners.py:195, in run(main, debug, loop_factory)
    191     raise RuntimeError(
    192         "asyncio.run() cannot be called from a running event loop")
    194 with Runner(debug=debug, loop_factory=loop_factory) as runner:
--> 195     return runner.run(main)

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.13/asyncio/runners.py:118, in Runner.run(self, coro, context)
    116 self._interrupt_count = 0
    117 try:
--> 118     return self._loop.run_until_complete(task)
    119 except exceptions.CancelledError:
    120     if self._interrupt_count > 0:

File /opt/homebrew/Caskroom/mambaforge/base/envs/clouddrift/lib/python3.13/asyncio/base_events.py:719, in BaseEventLoop.run_until_complete(self, future)
    716 if not future.done():
    717     raise RuntimeError('Event loop stopped before Future completed.')
--> 719 return future.result()

File ~/projects.git/clouddrift/clouddrift/adapters/gdp/gdp_sensor.py:538, in _parallel_get(sources, gdp_metadata_df, chunk_size, tmp_path, use_fill_values, max_chunks)
    536     chunk = jobmap[ajob]
    537     _logger.warn(f"bad chunk detected, exception: {ajob.exception()}")
--> 538     raise exc
    540 job_drifter_ds_map: dict[int, xr.Dataset] = ajob.result()
    541 for id_ in job_drifter_ds_map.keys():

TypeError: '>' not supported between instances of 'str' and 'int'

@selipot selipot self-assigned this Jun 4, 2025
@selipot selipot added the enhancement New feature or request label Jun 4, 2025
@selipot
Copy link
Copy Markdown
Member Author

selipot commented Jun 4, 2025

Not sure why the year is read as a string?

@selipot selipot marked this pull request as draft June 5, 2025 17:02
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 5, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

📢 Thoughts on this report? Let us know!

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@selipot
Copy link
Copy Markdown
Member Author

selipot commented Jun 5, 2025

Thanks @KevinShuman! your modifications allowed me to complete the process and create a ragged array based on the s files. I will now spend some time checking if the result makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants