|
3 | 3 | Usage |
4 | 4 | ===== |
5 | 5 |
|
6 | | -Data format |
7 | | ------------ |
8 | | - |
9 | | -The first release of CloudDrift provide a relatively *easy* way to convert any Lagrangian datasets into an archive of `contiguous ragged arrays <https://cfconventions.org/cf-conventions/cf-conventions.html#_contiguous_ragged_array_representation>`_. We provide a step-by-step guides to convert the individual trajectories from the Global Drifter Program (GDP) hourly and 6-hourly datasets, the drifters from the `CARTHE <http://carthe.org/>`_ experiment, and a typical output from a numerical Lagrangian experiment. |
10 | | - |
11 | | -Below is a quick overview on how to transform an observational Lagrangian dataset stored into multiple files, or a numerical output from a Lagrangian simulation framework. Detailed examples are provided as Jupyter Notebooks which can be tested directly in a `Binder <https://mybinder.org/v2/gh/Cloud-Drift/clouddrift/main?labpath=examples>`_ executable environment. |
12 | | - |
13 | | -Collection of files |
14 | | -~~~~~~~~~~~~~~~~~~~ |
15 | | - |
16 | | -First, to create a ragged arrays archive for a dataset for which each trajectory is stored into a individual archive, e.g. the FTP distribution of the `GDP hourly dataset <https://www.aoml.noaa.gov/phod/gdp/hourly_data.php>`_, it is required to define a `preprocessing` function that returns an `xarray.Dataset <https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html>`_ for a trajectory from its identification number. |
17 | | - |
18 | | -.. code-block:: python |
19 | | -
|
20 | | - def preprocess(index: int) -> xr.Dataset: |
21 | | - """ |
22 | | - :param index: drifter's identification number |
23 | | - :return: xr.Dataset containing the data and attributes |
24 | | - """ |
25 | | - ds = xr.load_dataset(f'data/file_{index}.nc') |
26 | | -
|
27 | | - # perform required preprocessing steps |
28 | | - # e.g. change units, remove variables, fix attributes, etc. |
29 | | -
|
30 | | - return ds |
31 | | -
|
32 | | -This function will be called for each indices of the dataset (`ids`) to construct the ragged arrays archive, as follow. The ragged arrays contains the required coordinates variables, as well as the specified metadata and data variables. Note that metadata variables contain one value per trajectory while the data variables contain `n` observations per trajectory. |
33 | | - |
34 | | -.. code-block:: python |
35 | | -
|
36 | | - ids = [1,2,3] # trajectories to combine |
37 | | -
|
38 | | - # mandatory coordinates variables |
39 | | - coords = {'ids': 'ids', 'time': 'time', 'lon': 'longitude', 'lat': 'latitude'} |
40 | | -
|
41 | | - # list of metadata and data from files to include in archive |
42 | | - metadata = ['ID', 'rowsize'] |
43 | | - data = ['ve', 'vn'] |
44 | | -
|
45 | | - ra = RaggedArray.from_files(ids, preprocess, coords, metadata, data) |
46 | | -
|
47 | | -which can be easily export to either a parquet archive file, |
48 | | - |
49 | | -.. code-block:: python |
50 | | -
|
51 | | - ra.to_parquet('data/archive.parquet') |
52 | | -
|
53 | | -or a NetCDF archive file. |
54 | | - |
55 | | -.. code-block:: python |
56 | | -
|
57 | | - ra.to_parquet('data/archive.nc') |
58 | | -
|
59 | | -Lagrangian numerical output |
60 | | -~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
61 | | - |
62 | | -For a two-dimensional output (`lon`, `lat`, `time`) from a Lagrangian simulation framework (such as `OceanParcels <https://oceanparcels.org/>`_ or `OpenDrift <https://opendrift.github.io/>`_), the ragged arrays archive can be obtained by reshaping the variables to ragged arrays and populating the following dictionaries containing the coordinates, metadata, data, and attributes. |
| 6 | +CloudDrift provides an easy way to convert Lagrangian datasets into |
| 7 | +`contiguous ragged arrays <https://cfconventions.org/cf-conventions/cf-conventions.html#_contiguous_ragged_array_representation>`_. |
63 | 8 |
|
64 | 9 | .. code-block:: python |
65 | 10 |
|
66 | | - # initialize dictionaries |
67 | | - coords = {} |
68 | | - metadata = {} |
69 | | -
|
70 | | - # note that this example dataset does not contain other data than time, lon, lat, and ids |
71 | | - # an empty dictionary "data" is initialize anyway |
72 | | - data = {} |
| 11 | + # Import a GDP-hourly adapter function |
| 12 | + from clouddrift.adapters.gdp import to_raggedarray |
73 | 13 |
|
74 | | -Numerical outputs are usually stored as a 2D matrix (`trajectory`, `time`) filled with `nan` where there is no data. The first step is to identify the finite values and reshape the dataset. |
75 | | - |
76 | | -.. code-block:: python |
| 14 | + # Download 100 random GDP-hourly trajectories as a ragged array |
| 15 | + ra = to_raggedarray(n_random_id=100) |
77 | 16 |
|
78 | | - ds = xr.open_dataset(join(folder, file), decode_times=False) |
79 | | - finite_values = np.isfinite(ds['lon']) |
80 | | - idx_finite = np.where(finite_values) |
| 17 | + # Store to NetCDF and Parquet files |
| 18 | + ra.to_netcdf("gdp.nc") |
| 19 | + ra.to_parquet("gdp.parquet") |
81 | 20 |
|
82 | | - # dimension and id of each trajectory |
83 | | - rowsize = np.bincount(idx_finite[0]) |
84 | | - unique_id = np.unique(idx_finite[0]) |
| 21 | + # Convert to Xarray Dataset for analysis |
| 22 | + ds = ra.to_xarray() |
85 | 23 |
|
86 | | - # coordinate variables |
87 | | - coords["time"] = np.tile(ds.time.data, (ds.dims['traj'],1))[idx_finite] |
88 | | - coords["lon"] = ds.lon.data[idx_finite] |
89 | | - coords["lat"] = ds.lat.data[idx_finite] |
90 | | - coords["ids"] = np.repeat(unique_id, rowsize) |
91 | | -
|
92 | | -Once this is done, we can include extra metadata, such as the size of each trajectory (`rowsize`), and create the ragged arrays archive. |
93 | | - |
94 | | -.. code-block:: python |
95 | | -
|
96 | | - # metadata |
97 | | - metadata["rowsize"] = rowsize |
98 | | - metadata["ID"] = unique_id |
99 | | -
|
100 | | - # create the ragged arrays |
101 | | - ra = RaggedArray(coords, metadata, data) |
102 | | - ra.to_parquet('data/archive.parquet') |
103 | | -
|
104 | | -Analysis |
105 | | --------- |
106 | | - |
107 | | -Once an archive of ragged arrays is created, CloudDrift provides way to read in and convert the data to an `Awkward Array <https://awkward-array.org/quickstart.html>`_. |
108 | | - |
109 | | -.. code-block:: python |
| 24 | + # Alternatively, convert to Awkward Array for analysis |
| 25 | + ds = ra.to_awkward() |
110 | 26 |
|
111 | | - ra = RaggedArray.from_parquet('data/archive.parquet') |
112 | | - ds = ra.to_awkward() |
| 27 | +This snippet is specific to the hourly GDP dataset, however, you can use the |
| 28 | +``RaggedArray`` class directly to convert other custom datasets into a ragged |
| 29 | +array structure that is analysis ready via Xarray or Awkward Array packages. |
| 30 | +We provide step-by-step guides to convert the individual trajectories from the |
| 31 | +Global Drifter Program (GDP) hourly and 6-hourly datasets, the drifters from the |
| 32 | +`CARTHE <http://carthe.org/>`_ experiment, and a typical output from a numerical |
| 33 | +Lagrangian experiment in our |
| 34 | +`repository of example Jupyter Notebooks <https://github.com/cloud-drift/clouddrift-examples>`_. |
| 35 | +You can use these examples as a reference to ingest your own or other custom |
| 36 | +Lagrangian datasets into ``RaggedArray``. |
113 | 37 |
|
114 | | -Over the next year, the CloudDrift project will be developing a cloud-ready analysis library to perform typical Lagrangian workflows. |
| 38 | +In the future, ``clouddrift`` will be including functions to perform typical |
| 39 | +oceanographic Lagrangian analyses. |
0 commit comments