Update docs (#108)

milancurcic · web-flow · commit 271e728f154c · 2023-02-07T10:50:16.000-05:00
* Add top-level docstrings

* Add adapters to the API docs

* Rewrite Usage

* Update installation
diff --git a/clouddrift/adapters/__init__.py b/clouddrift/adapters/__init__.py
@@ -1 +1,9 @@
+"""
+This module provides adapters to custom datasets.
+Each adapter module provides convenience functions and metadata to convert a
+custom dataset to a `clouddrift.RaggedArray` instance.
+Currently, clouddrift only provides an adapter module for the hourly Global
+Drifter Program (GDP) data, and more adapters will be added in the future.
+"""
+
 import clouddrift.adapters.gdp
diff --git a/clouddrift/adapters/gdp.py b/clouddrift/adapters/gdp.py
@@ -1,3 +1,8 @@
+"""
+This module provides functions and metadata that can be used to convert the
+hourly Global Drifter Program (GDP) data to a ``clouddrift.RaggedArray`` instance.
+"""
+
 from ..dataformat import RaggedArray
 import numpy as np
 import pandas as pd
@@ -254,24 +259,28 @@ def str_to_float(value: str, default=np.nan) -> float:
         return default
 
 
-def cut_str(value, max_length):
-    """
-    Cut a string to a specific length.
-    :param value: string
-           max_length: length of the output
-    :return: string with max_length chars
+def cut_str(value: str, max_length: int) -> np.chararray:
+    """Cut a string to a specific length and return it as a numpy chararray.
+
+    Args:
+        value (str): String to cut
+        max_length (int): Length of the output
+    Returns:
+        out (np.chararray): String with max_length characters
     """
     charar = np.chararray(1, max_length)
     charar[:max_length] = value
     return charar
 
 
 def drogue_presence(lost_time, time):
-    """
-    Create drogue status from the drogue lost time and the trajectory time
-    :params lost_time: timestamp of the drogue loss (or NaT)
-            time[obs]: observation time
-    :return: bool[obs]: 1 drogued, 0 undrogued
+    """Create drogue status from the drogue lost time and the trajectory time.
+
+    Args:
+        lost_time: Timestamp of the drogue loss (or NaT)
+        time: Observation time
+    Returns:
+        out (bool): True if drogues and False otherwise
     """
     if pd.isnull(lost_time) or lost_time >= time[-1]:
         return np.ones_like(time, dtype="bool")
diff --git a/docs/api.rst b/docs/api.rst
@@ -5,6 +5,20 @@ API
 
 Auto-generated summary of CloudDrift's API. For more details and examples, refer to the different Jupyter Notebooks.
 
+Adapters
+--------
+
+.. automodule:: clouddrift.adapters
+  :members:
+  :undoc-members:
+
+GDP
+^^^
+
+.. automodule:: clouddrift.adapters.gdp
+  :members:
+  :undoc-members:
+
 Analysis
 --------
 
diff --git a/docs/install.rst b/docs/install.rst
@@ -1,24 +1,64 @@
 .. _install:
 
 Installation
-=============
+============
+
+You can install the latest release of CloudDrift using pip or Conda.
+You can also install the latest development (unreleased) version from GitHub.
+
+pip
+---
 
-For most *users*, the latest official package can be obtained from the `PyPi <pypi.org/project/clouddrift/>`_ repository: 
+In your virtual environment, type:
 
 .. code-block:: text
 
   pip install clouddrift
 
-or (soon!) from the conda-forge repository:
+Conda
+-----
+
+First add ``conda-forge`` to your channels in your Conda environment:
+
+.. code-block:: text
+
+  conda config --add channels conda-forge
+  conda config --set channel_priority strict
+
+then install CloudDrift:
+
+.. code-block:: text
+
+  conda install clouddrift
+
+Developers
+----------
+
+If you need the latest development version, get it from GitHub using pip:
+
+.. code-block:: text
+
+  pip install git+https://github.com/Cloud-Drift/clouddrift
+
+Running tests
+=============
+
+To run the tests, you need to first download the CloudDrift source code from
+GitHub and install it in your virtual environment:
+
 
 .. code-block:: text
 
-  conda install -c conda-forge clouddrift
+  git clone https://github.com/cloud-drift/clouddrift
+  cd clouddrift
+  python3 -m venv venv
+  source venv/bin/activate
+  pip install .
 
-For *developpers* who want to install the latest development version, you can install directly from the clouddrift's GitHub repository:
+Then, run the tests like this:
 
 .. code-block:: text
 
-  pip install git+https://github.com/Cloud-Drift/clouddrift.git
+  python -m unittest tests/*.py
 
 A quick how-to guide is provided on the `Usage <https://cloud-drift.github.io/clouddrift/usage.html>`_ page.
diff --git a/docs/usage.rst b/docs/usage.rst
@@ -3,112 +3,37 @@
 Usage
 =====
 
-Data format
------------
-
-The first release of CloudDrift provide a relatively *easy* way to convert any Lagrangian datasets into an archive of `contiguous ragged arrays <https://cfconventions.org/cf-conventions/cf-conventions.html#_contiguous_ragged_array_representation>`_. We provide a step-by-step guides to convert the individual trajectories from the Global Drifter Program (GDP) hourly and 6-hourly datasets, the drifters from the `CARTHE <http://carthe.org/>`_ experiment, and a typical output from a numerical Lagrangian experiment.
-
-Below is a quick overview on how to transform an observational Lagrangian dataset stored into multiple files, or a numerical output from a Lagrangian simulation framework. Detailed examples are provided as Jupyter Notebooks which can be tested directly in a `Binder <https://mybinder.org/v2/gh/Cloud-Drift/clouddrift/main?labpath=examples>`_ executable environment.
-
-Collection of files
-~~~~~~~~~~~~~~~~~~~
-
-First, to create a ragged arrays archive for a dataset for which each trajectory is stored into a individual archive, e.g. the FTP distribution of the `GDP hourly dataset <https://www.aoml.noaa.gov/phod/gdp/hourly_data.php>`_, it is required to define a `preprocessing` function that returns an `xarray.Dataset <https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html>`_ for a trajectory from its identification number.
-
-.. code-block:: python
-
-   def preprocess(index: int) -> xr.Dataset:
-      """
-      :param index: drifter's identification number
-      :return: xr.Dataset containing the data and attributes
-      """
-      ds = xr.load_dataset(f'data/file_{index}.nc')
-
-      # perform required preprocessing steps
-      # e.g. change units, remove variables, fix attributes, etc.
-
-      return ds
-
-This function will be called for each indices of the dataset (`ids`) to construct the ragged arrays archive, as follow. The ragged arrays contains the required coordinates variables, as well as the specified metadata and data variables. Note that metadata variables contain one value per trajectory while the data variables contain `n` observations per trajectory.
-
-.. code-block:: python
-
-   ids = [1,2,3]  # trajectories to combine
-
-   # mandatory coordinates variables
-   coords = {'ids': 'ids', 'time': 'time', 'lon': 'longitude', 'lat': 'latitude'}
-
-   # list of metadata and data from files to include in archive
-   metadata = ['ID', 'rowsize']
-   data = ['ve', 'vn']
-
-   ra = RaggedArray.from_files(ids, preprocess, coords, metadata, data)
-
-which can be easily export to either a parquet archive file,
-
-.. code-block:: python
-
-   ra.to_parquet('data/archive.parquet')
-
-or a NetCDF archive file.
-
-.. code-block:: python
-
-   ra.to_parquet('data/archive.nc')
-
-Lagrangian numerical output
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-For a two-dimensional output (`lon`, `lat`, `time`) from a Lagrangian simulation framework (such as `OceanParcels <https://oceanparcels.org/>`_ or `OpenDrift <https://opendrift.github.io/>`_), the ragged arrays archive can be obtained by reshaping the variables to ragged arrays and populating the following dictionaries containing the coordinates, metadata, data, and attributes.
+CloudDrift provides an easy way to convert Lagrangian datasets into
+`contiguous ragged arrays <https://cfconventions.org/cf-conventions/cf-conventions.html#_contiguous_ragged_array_representation>`_.
 
 .. code-block:: python
 
-   # initialize dictionaries
-   coords = {}
-   metadata = {}
-
-   # note that this example dataset does not contain other data than time, lon, lat, and ids
-   # an empty dictionary "data" is initialize anyway
-   data = {}
+    # Import a GDP-hourly adapter function
+    from clouddrift.adapters.gdp import to_raggedarray
 
-Numerical outputs are usually stored as a 2D matrix (`trajectory`, `time`) filled with `nan` where there is no data. The first step is to identify the finite values and reshape the dataset.
-
-.. code-block:: python
+    # Download 100 random GDP-hourly trajectories as a ragged array
+    ra = to_raggedarray(n_random_id=100)
 
-   ds = xr.open_dataset(join(folder, file), decode_times=False)
-   finite_values = np.isfinite(ds['lon'])
-   idx_finite = np.where(finite_values)
+    # Store to NetCDF and Parquet files
+    ra.to_netcdf("gdp.nc")
+    ra.to_parquet("gdp.parquet")
 
-   # dimension and id of each trajectory
-   rowsize = np.bincount(idx_finite[0])
-   unique_id = np.unique(idx_finite[0])
+    # Convert to Xarray Dataset for analysis
+    ds = ra.to_xarray()
 
-   # coordinate variables
-   coords["time"] = np.tile(ds.time.data, (ds.dims['traj'],1))[idx_finite]
-   coords["lon"] = ds.lon.data[idx_finite]
-   coords["lat"] = ds.lat.data[idx_finite]
-   coords["ids"] = np.repeat(unique_id, rowsize)
-
-Once this is done, we can include extra metadata, such as the size of each trajectory (`rowsize`), and create the ragged arrays archive.
-
-.. code-block:: python
-
-   # metadata
-   metadata["rowsize"] = rowsize
-   metadata["ID"] = unique_id
-
-   # create the ragged arrays
-   ra = RaggedArray(coords, metadata, data)
-   ra.to_parquet('data/archive.parquet')
-
-Analysis
---------
-
-Once an archive of ragged arrays is created, CloudDrift provides way to read in and convert the data to an `Awkward Array <https://awkward-array.org/quickstart.html>`_.
-
-.. code-block:: python
+    # Alternatively, convert to Awkward Array for analysis
+    ds = ra.to_awkward()
 
-   ra = RaggedArray.from_parquet('data/archive.parquet')
-   ds = ra.to_awkward()
+This snippet is specific to the hourly GDP dataset, however, you can use the
+``RaggedArray`` class directly to convert other custom datasets into a ragged
+array structure that is analysis ready via Xarray or Awkward Array packages. 
+We provide step-by-step guides to convert the individual trajectories from the
+Global Drifter Program (GDP) hourly and 6-hourly datasets, the drifters from the
+`CARTHE <http://carthe.org/>`_ experiment, and a typical output from a numerical
+Lagrangian experiment in our
+`repository of example Jupyter Notebooks <https://github.com/cloud-drift/clouddrift-examples>`_.
+You can use these examples as a reference to ingest your own or other custom
+Lagrangian datasets into ``RaggedArray``.
 
-Over the next year, the CloudDrift project will be developing a cloud-ready analysis library to perform typical Lagrangian workflows.
+In the future, ``clouddrift`` will be including functions to perform typical
+oceanographic Lagrangian analyses.