Skip to content

Commit f9eff5a

Browse files
authored
Begin reworking the Usage page in the docs (#213)
* Begin reworking the Usage page in the docs * Check for count or rowsize in subset * Subset to a smaller dataset in Usage * Fix ordered list in docstring * Switch to sphinx-book-theme * Update copyright year * Complete the mini-tutorial
1 parent 3898938 commit f9eff5a

File tree

4 files changed

+257
-24
lines changed

4 files changed

+257
-24
lines changed

clouddrift/analysis.py

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -696,12 +696,9 @@ def velocity_from_position(
696696
697697
Difference scheme can take one of three values:
698698
699-
1. "forward" (default): finite difference is evaluated as
700-
dx[i] = dx[i+1] - dx[i];
701-
2. "backward": finite difference is evaluated as
702-
dx[i] = dx[i] - dx[i-1];
703-
3. "centered": finite difference is evaluated as
704-
dx[i] = (dx[i+1] - dx[i-1]) / 2.
699+
#. "forward" (default): finite difference is evaluated as ``dx[i] = dx[i+1] - dx[i]``;
700+
#. "backward": finite difference is evaluated as ``dx[i] = dx[i] - dx[i-1]``;
701+
#. "centered": finite difference is evaluated as ``dx[i] = (dx[i+1] - dx[i-1]) / 2``.
705702
706703
Forward and backward schemes are effectively the same except that the
707704
position at which the velocity is evaluated is shifted one element down in
@@ -977,6 +974,18 @@ def subset(ds: xr.Dataset, criteria: dict) -> xr.Dataset:
977974
ValueError
978975
If one of the variable in a criterion is not found in the Dataset
979976
"""
977+
# Normally we expect the ragged-array dataset to have a "count" variable.
978+
# However, some datasets may have a "rowsize" variable instead, e.g. if they
979+
# have not gotten up to speed with our new convention. We check for both.
980+
if "count" in ds.variables:
981+
count_var = "count"
982+
elif "rowsize" in ds.variables:
983+
count_var = "rowsize"
984+
else:
985+
raise ValueError(
986+
"Ragged-array Dataset ds must have a 'count' or 'rowsize' variable."
987+
)
988+
980989
mask_traj = xr.DataArray(data=np.ones(ds.dims["traj"], dtype="bool"), dims=["traj"])
981990
mask_obs = xr.DataArray(data=np.ones(ds.dims["obs"], dtype="bool"), dims=["obs"])
982991

@@ -990,7 +999,7 @@ def subset(ds: xr.Dataset, criteria: dict) -> xr.Dataset:
990999
raise ValueError(f"Unknown variable '{key}'.")
9911000

9921001
# remove data when trajectories are filtered
993-
traj_idx = np.insert(np.cumsum(ds["count"].values), 0, 0)
1002+
traj_idx = np.insert(np.cumsum(ds[count_var].values), 0, 0)
9941003
for i in np.where(~mask_traj)[0]:
9951004
mask_obs[slice(traj_idx[i], traj_idx[i + 1])] = False
9961005

@@ -1006,7 +1015,7 @@ def subset(ds: xr.Dataset, criteria: dict) -> xr.Dataset:
10061015
# apply the filtering for both dimensions
10071016
ds_sub = ds.isel({"traj": mask_traj, "obs": mask_obs})
10081017
# update the count
1009-
ds_sub["count"].values = segment(
1018+
ds_sub[count_var].values = segment(
10101019
ds_sub.ids, 0.5, count=segment(ds_sub.ids, -0.5)
10111020
)
10121021
return ds_sub

docs/conf.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
# -- Project information -----------------------------------------------------
2020

2121
project = "CloudDrift"
22-
copyright = "2022, CloudDrift"
22+
copyright = "2022-2023, CloudDrift"
2323
author = "Philippe Miron"
2424

2525
# -- General configuration ---------------------------------------------------
@@ -49,9 +49,7 @@
4949

5050
# The theme to use for HTML and HTML Help pages. See the documentation for
5151
# a list of builtin themes.
52-
#
53-
html_theme = "pydata_sphinx_theme" # alabaster, sphinx_rtd_theme
54-
# html_theme = "sphinx_rtd_theme"
52+
html_theme = "sphinx_book_theme" # alabaster, sphinx_rtd_theme
5553

5654
# Add any paths that contain custom static files (such as style sheets) here,
5755
# relative to this directory. They are copied after the builtin static files,

docs/requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
sphinx
2-
pydata_sphinx_theme
2+
sphinx-book-theme

docs/usage.rst

Lines changed: 237 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,240 @@
33
Usage
44
=====
55

6-
CloudDrift provides an easy way to convert Lagrangian datasets into
6+
The CloudDrift library provides functions for:
7+
8+
* Easy access to cloud-ready Lagrangian ragged-array datasets;
9+
* Common Lagrangian analysis tasks on ragged arrays;
10+
* Adapting custom Lagrangian datasets into ragged arrays.
11+
12+
Let's start by importing the library and accessing a ready-to-use ragged-array
13+
dataset.
14+
15+
Accessing ragged-array Lagrangian datasets
16+
------------------------------------------
17+
18+
We recommend to import the ``clouddrift`` using the ``cd`` shorthand, for convenience:
19+
20+
>>> import clouddrift as cd
21+
22+
CloudDrift provides a set of Lagrangian datasets that are ready to use.
23+
They can be accessed via the ``datasets`` submodule.
24+
In this example, we will load the NOAA's Global Drifter Program (GDP) hourly
25+
dataset, which is hosted in a public AWS bucket as a cloud-optimized Zarr
26+
dataset:
27+
28+
>>> ds = cd.datasets.gdp1h()
29+
>>> ds
30+
<xarray.Dataset>
31+
Dimensions: (traj: 17324, obs: 165754333)
32+
Coordinates:
33+
ids (obs) int64 ...
34+
lat (obs) float32 ...
35+
lon (obs) float32 ...
36+
time (obs) datetime64[ns] ...
37+
Dimensions without coordinates: traj, obs
38+
Data variables: (12/55)
39+
BuoyTypeManufacturer (traj) |S20 ...
40+
BuoyTypeSensorArray (traj) |S20 ...
41+
CurrentProgram (traj) float64 ...
42+
DeployingCountry (traj) |S20 ...
43+
DeployingShip (traj) |S20 ...
44+
DeploymentComments (traj) |S20 ...
45+
... ...
46+
sst1 (obs) float64 ...
47+
sst2 (obs) float64 ...
48+
typebuoy (traj) |S10 ...
49+
typedeath (traj) int8 ...
50+
ve (obs) float32 ...
51+
vn (obs) float32 ...
52+
Attributes: (12/16)
53+
Conventions: CF-1.6
54+
acknowledgement: Elipot, Shane; Sykulski, Adam; Lumpkin, Rick; Centurio...
55+
contributor_name: NOAA Global Drifter Program
56+
contributor_role: Data Acquisition Center
57+
date_created: 2022-12-09T06:02:29.684949
58+
doi: 10.25921/x46c-3620
59+
... ...
60+
processing_level: Level 2 QC by GDP drifter DAC
61+
publisher_email: [email protected]
62+
publisher_name: GDP Drifter DAC
63+
publisher_url: https://www.aoml.noaa.gov/phod/gdp
64+
summary: Global Drifter Program hourly data
65+
title: Global Drifter Program hourly drifting buoy collection
66+
67+
The ``gdp1h`` function returns an Xarray ``Dataset`` instance of the ragged-array dataset.
68+
While the dataset is quite large, around a dozen GB, it is not downloaded to your
69+
local machine. Instead, the dataset is accessed directly from the cloud, and only
70+
the data that is needed for the analysis is downloaded. This is possible thanks to
71+
the cloud-optimized Zarr format, which allows for efficient access to the data
72+
stored in the cloud.
73+
74+
Let's look at some variables in this dataset:
75+
76+
>>> ds.lon
77+
<xarray.DataArray 'lon' (obs: 165754333)>
78+
[165754333 values with dtype=float32]
79+
Coordinates:
80+
ids (obs) int64 ...
81+
lat (obs) float32 ...
82+
lon (obs) float32 ...
83+
time (obs) datetime64[ns] ...
84+
Dimensions without coordinates: obs
85+
Attributes:
86+
long_name: Longitude
87+
units: degrees_east
88+
89+
You see that this array is very long--it has 165754333 elements.
90+
This is because in a ragged array, many varying-length arrays are laid out as a
91+
contiguous 1-dimensional array in memory.
92+
93+
Let's look at the dataset dimensions:
94+
95+
>>> ds.dims
96+
Frozen({'traj': 17324, 'obs': 165754333})
97+
98+
The ``traj`` dimension has 17324 elements, which is the number of individual
99+
trajectories in the dataset.
100+
The sum of their lengths equals the length of the ``obs`` dimension.
101+
Internally, these dimensions, their lengths, and the ``count`` (or ``rowsize``)
102+
variables are used internally to make CloudDrift's analysis functions aware of
103+
the bounds of each contiguous array within the ragged-array data structure.
104+
105+
Doing common analysis tasks on ragged arrays
106+
--------------------------------------------
107+
108+
Now that we have a ragged-array dataset loaded as an Xarray ``Dataset`` instance,
109+
let's do some common analysis tasks on it.
110+
Our dataset is on a remote server and fairly large (a dozen GB or so), so let's
111+
first subset it to several trajectories so that we can more easily work with it.
112+
The variable ``ID`` is the unique identifier for each trajectory:
113+
114+
>>> ds.ID[:10].values
115+
array([2578, 2582, 2583, 2592, 2612, 2613, 2622, 2623, 2931, 2932])
116+
117+
>>> from clouddrift.analysis import subset
118+
119+
``subset`` allows you to subset a ragged array by some criterion.
120+
In this case, we will subset it by the ``ID`` variable:
121+
122+
>>> ds_sub = subset(ds, {"ID": list(ds.ID[:5])})
123+
>>> ds_sub
124+
<xarray.Dataset>
125+
Dimensions: (traj: 5, obs: 13612)
126+
Coordinates:
127+
ids (obs) int64 2578 2578 2578 2578 ... 2612 2612 2612
128+
lat (obs) float32 ...
129+
lon (obs) float32 ...
130+
time (obs) datetime64[ns] ...
131+
Dimensions without coordinates: traj, obs
132+
Data variables: (12/55)
133+
BuoyTypeManufacturer (traj) |S20 ...
134+
BuoyTypeSensorArray (traj) |S20 ...
135+
CurrentProgram (traj) float64 ...
136+
DeployingCountry (traj) |S20 ...
137+
DeployingShip (traj) |S20 ...
138+
DeploymentComments (traj) |S20 ...
139+
... ...
140+
sst1 (obs) float64 ...
141+
sst2 (obs) float64 ...
142+
typebuoy (traj) |S10 ...
143+
typedeath (traj) int8 ...
144+
ve (obs) float32 ...
145+
vn (obs) float32 ...
146+
Attributes: (12/16)
147+
Conventions: CF-1.6
148+
acknowledgement: Elipot, Shane; Sykulski, Adam; Lumpkin, Rick; Centurio...
149+
contributor_name: NOAA Global Drifter Program
150+
contributor_role: Data Acquisition Center
151+
date_created: 2022-12-09T06:02:29.684949
152+
doi: 10.25921/x46c-3620
153+
... ...
154+
processing_level: Level 2 QC by GDP drifter DAC
155+
publisher_email: [email protected]
156+
publisher_name: GDP Drifter DAC
157+
publisher_url: https://www.aoml.noaa.gov/phod/gdp
158+
summary: Global Drifter Program hourly data
159+
title: Global Drifter Program hourly drifting buoy collection
160+
161+
You see that we now have a subset of the original dataset, with 5 trajectories
162+
and a total of 13612 observations.
163+
This subset is small enough to quickly and easily work with for demonstration
164+
purposes.
165+
Let's see how we can compute the mean and maximum velocities of each trajectory.
166+
To start, we'll need to obtain the velocities over all trajectory times.
167+
Although the GDP dataset already comes with velocity variables, we won't use
168+
them here so that we can learn how to compute them ourselves from positions.
169+
``clouddrift`` provides the ``velocity_from_position`` function that allows you
170+
to do just that.
171+
172+
>>> from clouddrift.analysis import velocity_from_position
173+
174+
At a minimum ``velocity_from_position`` requires three input parameters:
175+
consecutive x- and y-coordinates and time, so we could do:
176+
177+
>>> u, v = velocity_from_position(ds_sub.lon, ds_sub.lat, ds_sub.time)
178+
179+
``velocity_from_position`` returns two arrays, ``u`` and ``v``, which are the
180+
zonal and meridional velocities, respectively.
181+
By default, it assumes that the coordinates are in degrees, and it handles the
182+
great circle path calculation and longitude wraparound under the hood.
183+
However, recall that ``ds_sub.lon``, ``ds_sub.lat``, and ``ds_sub.time`` are
184+
ragged arrays, so we need a different approach to calculate velocities while
185+
respecting the trajectory boundaries.
186+
For this, we can use the ``ragged_apply`` function, which applies a function
187+
to each trajectory in a ragged array, and returns the concatenated result.
188+
189+
>>> from clouddrift.analysis import apply_ragged
190+
>>> u, v = apply_ragged(velocity_from_position, [ds_sub.lon, ds_sub.lat, ds_sub.time], ds_sub.rowsize)
191+
192+
``u`` and ``v`` here are still ragged arrays, which means that the five
193+
contiguous trajectories are concatenated into 1-dimensional arrays.
194+
195+
Now, let's compute the velocity magnitude in meters per second.
196+
The time in this dataset is loaded in nanoseconds by default:
197+
198+
>>> ds_sub.time.values
199+
array(['2005-04-15T20:00:00.000000000', '2005-04-15T21:00:00.000000000',
200+
'2005-04-15T22:00:00.000000000', ...,
201+
'2005-10-02T03:00:00.000000000', '2005-10-02T04:00:00.000000000',
202+
'2005-10-02T05:00:00.000000000'], dtype='datetime64[ns]')
203+
204+
So, to obtain the velocity magnitude in meters per second, we'll need to
205+
multiply our velocities by ``1e9``.
206+
207+
>>> velocity_magnitude = np.sqrt(u**2 + v**2) * 1e9
208+
>>> velocity_magnitude
209+
array([0.28053388, 0.6164632 , 0.89032112, ..., 0.2790803 , 0.20095603,
210+
0.20095603])
211+
212+
>>> velocity_magnitude.mean(), velocity_magnitude.max()
213+
(0.22115242718877506, 1.6958275672626286)
214+
215+
However, these aren't the results we are looking for! Recall that we have the
216+
velocity magnitude of five different trajectories concatenated into one array.
217+
This means that we need to use ``apply_ragged`` again to compute the mean and
218+
maximum values:
219+
220+
>>> apply_ragged(np.mean, [velocity_magnitude], ds_sub.rowsize)
221+
array([0.32865148, 0.17752435, 0.1220523 , 0.13281067, 0.14041268])
222+
>>> apply_ragged(np.max, [velocity_magnitude], ds_sub.rowsize)
223+
array([1.69582757, 1.36804354, 0.97343434, 0.60353528, 1.05044213])
224+
225+
And there you go! We used ``clouddrift`` to:
226+
227+
#. Load a real-world Lagrangian dataset from the cloud;
228+
#. Subset the dataset by trajectory IDs;
229+
#. Compute the velocity vectors and their magnitudes for each trajectory;
230+
#. Compute the mean and maximum velocity magnitudes for each trajectory.
231+
232+
``clouddrift`` offers many more functions for common Lagrangian analysis tasks.
233+
Please explore the `API <https://cloud-drift.github.io/clouddrift/api.html>`_
234+
to learn about other functions and how to use them.
235+
236+
Adapting custom Lagrangian datasets into ragged arrays
237+
------------------------------------------------------
238+
239+
CloudDrift provides an easy way to convert custom Lagrangian datasets into
7240
`contiguous ragged arrays <https://cfconventions.org/cf-conventions/cf-conventions.html#_contiguous_ragged_array_representation>`_.
8241

9242
.. code-block:: python
@@ -26,14 +259,7 @@ CloudDrift provides an easy way to convert Lagrangian datasets into
26259
27260
This snippet is specific to the hourly GDP dataset, however, you can use the
28261
``RaggedArray`` class directly to convert other custom datasets into a ragged
29-
array structure that is analysis ready via Xarray or Awkward Array packages.
30-
We provide step-by-step guides to convert the individual trajectories from the
31-
Global Drifter Program (GDP) hourly and 6-hourly datasets, the drifters from the
32-
`CARTHE <http://carthe.org/>`_ experiment, and a typical output from a numerical
33-
Lagrangian experiment in our
34-
`repository of example Jupyter Notebooks <https://github.com/cloud-drift/clouddrift-examples>`_.
262+
array structure that is analysis ready via Xarray or Awkward Array packages.
263+
The functions to do that are defined in the ``clouddrift.adapters`` submodule.
35264
You can use these examples as a reference to ingest your own or other custom
36-
Lagrangian datasets into ``RaggedArray``.
37-
38-
In the future, ``clouddrift`` will be including functions to perform typical
39-
oceanographic Lagrangian analyses.
265+
Lagrangian datasets into ``RaggedArray``.

0 commit comments

Comments
 (0)