nb-wrangler is a command-line tool that streamlines the curation of JupyterLab notebooks and their runtime environments. It automates the process of building and testing container images from notebook requirements, ensuring that notebooks have the correct dependencies to run successfully.
Key features include:
- Environment Management: Bootstraps its own dedicated Conda environments, isolated from your system's Python.
- Dependency Resolution: Compiles
requirements.txtfiles from multiple notebooks into a single, consistent set of versioned dependencies. - Automated Testing: Tests notebooks and their package imports to verify the environment.
- Data Management: Manages the data required to run notebooks.
- Image Building: Integrates with a build system to automatically create container images.
- Local Installs: Works equally well doing local installs with no Docker overhead or learning curve.
The project uses micromamba for environment management and uv for fast pip
package installation.
Note that while nb-wrangler was conceived as a way to streamline notebook
Docker image creation for JupyterHub, at it's core nb-wrangler is merely defining:
- A set of notebooks particularly relevant to a science platform.
- Any supporting data required to run those notebooks.
- A Python environment capable of running the entire set of notebooks.
- Tests to help verify the system is built correctly and correctly runs those notebooks.
- Standard methods to locally install the notebooks, data, and Python environment and run the tests.
All of the above don't even build Docker images directly, but nb-wrangler does provide two ways to hand off the information to STScI's science-platform-images GitHub repository which can autonomously or manually build an image from a wrangler spec.
Two other points are worthy of note:
- nb-wrangler can support easy installation of custom environments directly on the science platform that were not first pre-installed in the platform Docker image. This can be exploited to set up shared global or team installations areas as well as personalized environments.
- The network distribution and installation protocols used equally enable off-platform laptop users to set up the same environment locally in an easy manner.
nb-wrangler now supports pip installs into existing mamba environments:
pip install nb-wranglerThis is a relatively new feature and because the wrangler itself works by creating multiple environments it is not always as reliable as the bootstrap method described below. If you do use this method it's advisable to create a dedicated nbwrangler environment to install to.
To get started, bootstrap nb-wrangler to create the necessary environments and directories (by default in $HOME/.nbw-live):
curl https://raw.githubusercontent.com/spacetelescope/nb-wrangler/refs/heads/main/nb-wrangler >nb-wrangler
chmod +x nb-wrangler
./nb-wrangler bootstrapAfter bootstrapping, you can activate and/or reactivate the nbwrangler environment with:
source ./nb-wrangler environmentThis command sets up the shell environment and activates the nbwrangler Python environment so
that it (temporarily) replaces any other Python you had activated previously and is ready to
start executing wrangler commands. It should be mentioned that when the wrangler creates target
environments from the spec it will install them indepdently to this nbwrangler environment which
is intended only to support the tool itself.
The data download mechanisms in nb-wrangler utilize wget under the hood to fetch data URL's. If
your system does not already have it installed, you need to install it before you can curate or
reinstall data. On Linux (including the science platforms) wget is almost certainly there but
if not it should be an easy package install. On OS-X wget is an easy install via brew and also
via mamba. Before working on data tasks, it's worth verifying wget is available on your PATH.
This is viable but the exact environment settings and required workflows are still being formalized.
If you're curious contact [email protected] and we will work out platform and image appropriate instructions for doing in-situ development of platform environments using nb-wrangler.
The nb-wrangler workflow is divided into two main phases: curation and reinstallation.
Curation is the process of defining the notebooks, Python packages, and data
required for a specific environment. This is done by creating a nbw-spec.yaml
file that describes the desired environment. Typically notebook repository
maintainers perform these steps in addition their fundamental roles of producing
correct notebooks, pip requirements, and installable data.
The main curation workflows are:
--curate: Compiles notebook requirements, creates the mamba environment, and installs pip dependencies.--data-curate: Gathers data requirements from notebook repositories and adds them to the spec.--test, --test-imports, --test-notebooks: Tests the notebook imports and notebooks themselves in the context of the environment and data installation.
Example:
# Curate the software environment
./nb-wrangler spec.yaml --curate
# Test environment basics rapidly
./nb-wrangler spec.yaml --test-imports
# Curate the data dependencies
./nb-wrangler spec.yaml --data-curate
# Run each notebook headless using papermill
./nb-wrangler spec.yaml --test-notebooks
# Activate your new "target" environment! Here kernel-name == mamba environment you are curating
source ./nb-wrangler activate <your-kernel-name>
# Deactivate your current environment
source ./nb-wrangler deactivateThe curation process involves:
- Choosing notebooks: Selecting the notebooks to be included in the environment.
- Resolving dependencies: Identifying and resolving any conflicts between Python packages.
- Defining data sources: Specifying the data required by the notebooks.
- Testing: Building the environment and testing the notebooks to ensure they run correctly.
Reinstallation is the process of creating a new environment from a completed spec.yaml file. This is useful for reproducing an environment on a different machine or for a different user.
The main reinstallation workflows are:
--reinstall: Recreates the software environment from a spec.--data-reinstall: Installs the data required by the notebooks.--test, --test-imports, --test-notebooks: Tests the notebook imports and notebooks themselves within the defined environment and data.
Example:
# Reinstall the software environment
./nb-wrangler spec.yaml --reinstall
# Reinstall the data
./nb-wrangler spec.yaml --data-reinstall
# Run both import and notebook tests
./nb-wrangler spec.yaml --test-all
# Activate your new "target" environment! Here kernel-name == mamba environment you are curating
source ./nb-wrangler activate <your-kernel-name>
# Deactivate your current environment
source ./nb-wrangler deactivateFor both curation and reinstallation there is the assumption that tests may fail and it may be necessary to circle back to earlier steps, make fixes, and iterate.
For more information on notebooks and environment curation see Managing Notebook Selection and Environment For more information on supporting data see Managing Notebook Reference Data
To streamline development with custom branches without altering your core spec.yaml, nb-wrangler supports dev_overrides.
- The
dev_overridessection inspec.yamlallows you to temporarily specify development branches for repositories. - Use the
--devflag (or rely on implicit activation for curation workflows) to apply these overrides. - Use
--finalize-dev-overridesto remove thedev_overridessection when preparing for production.
For more details, see the Spec Format documentation.
nb-wrangler can also inject the package and test requirements from a spec into the classic Science Platform Images (SPI) repository layout. This is a transitional feature to support older build processes.
See the SPI Injection documentation for more details.
nb-wrangler provides a wide range of command-line options to customize its behavior. Here are some of the most common ones, grouped by function:
Workflows are commands that execute an ordered sequence of steps to accomplish some end-to-end task:
--curate: Run the full curation workflow to define notebooks and Python environment.--reinstall: Reinstall an environment from a spec.--reset-curation: Delete installation artifacts like the environment, install caches, and spec updates.--data-curate: Curate data dependencies.--data-reinstall: Reinstall data dependencies.--submit-for-build: Submit a spec for an automated image build.--inject-spi: Inject a spec into the SPI repository.
--env-init: Create and kernelize the target environment.--env-delete: Delete the target environment.--env-archive-delete: Delete the "code" portions of the pantry for the current spec (environment archives).--env-pack: Pack the target environment into an archive file.--env-unpack: Unpack an environment from an archive.--env-register: Register the environment as a Jupyter kernel.--env-unregister: Unregister the environment from Jupyter.--env-compact: Compact the wrangler installation by deleting package caches.--env-archive-format: Override format for environment pack/unpack.--env-print-name: Print the environment name for the spec.
--packages-compile: Compile package requirements.--packages-install: Install packages into the environment.--packages-uninstall: Uninstall packages from the environment.--packages-omit-spi: Don't include 'common' SPI packages.--packages-diagnostics: Show which requirements files are included and their required packages.
-t,--test-all: Run all tests (--test-importsand--test-notebooks).--test-imports: Test package imports.--test-notebooks [REGEX]: Test notebook execution. Can optionally take a comma-separated list of regex patterns to select specific notebooks.--test-notebooks-exclude [REGEX]: Exclude notebooks from testing using a comma-separated list of regex patterns.--jobs INT: Number of parallel jobs for notebook testing.--timeout INT: Timeout in seconds for notebook tests.
--data-collect: Collect data archive and installation info and add to spec.--data-list: List data archives.--data-download: Download data archives to the pantry.--data-update: Update metadata for data archives (e.g., length and hash).--data-validate: Validate pantry archives against the spec.--data-unpack: Unpack data archives.--data-pack: Pack live data directories into archive files.--data-reset-spec: Clear the 'data' sub-section of the 'out' section of the spec.--data-delete [archived|unpacked|both]: Delete data archives and/or unpacked files.--data-env-vars-mode [pantry|spec]: Define where to locate unpacked data.--data-print-exports: Print shell exports for data environment variables.--data-env-vars-no-auto-add: Do not automatically add data environment variables to the runtime environment.--data-select [REGEX]: Regex to select specific data archives.--data-no-validation: Skip data validation.--data-no-unpack-existing: Skip unpack if the target directory exists.--data-symlink-install-data: Create symlinks from install locations to the pantry data directory.
--clone-repos: Clone notebook repositories.--repos-dir: Directory for cloned repositories.--delete-repos: Delete cloned repositories.
--spec-reset: Reset the spec file to its original state (preservesout.data).--spec-add: Add the spec to the pantry (a local collection of specs).--spec-list: List available specs in the pantry.--spec-select [REGEX]: Select a spec from the pantry by regex.--spec-validate: Validate the spec file.--spec-update-hash: Update spec SHA256 hash.--spec-ignore-hash: Do not add or verify the spec hash.--spec-add-pip-hashes: Record PyPI hashes for packages during compilation.
--verbose: Enable DEBUG log output.--debug: Drop into debugger on exceptions.--profile: Run with cProfile and print stats.--reset-log: Delete the log file.--log-times: Configure timestamps in log messages.--color: Colorize log output.
For a full list of options, run ./nb-wrangler --help.
nb-wrangler uses several input formats to define the environment:
- Notebook (
.ipynb): Jupyter notebooks. - Wrangler Spec (
spec.yaml): The main YAML file that defines the notebook repositories and Python environment. See the spec format documentation for details on the new format, which uses arepositoriesdictionary and namedselected_notebooksblocks. - Notebook Repo: A Git repository containing Jupyter notebooks. e.g., TIKE Content, Roman Notebooks
- Science Platform Images (
SPI): The GitHub repository where code for the docker images for the Science Platforms is kept. Science Platform Images - Refdata Spec (
refdata_dependencies.yaml): A YAML file in a notebook repository that specifies data dependencies. See the refdata dependencies documentation. - Requirements (
requirements.txt): A file specifying Python package dependencies for a notebook in its directory. - Supporting Python (
.py): Any supporting Python files (.py) included in a notebook's directory.
The goal of nb-wrangler is to combine these inputs, resolve any conflicts, and create a unified environment capable of running all specified notebooks.
Secondary goals, include but are not limited to:
- Collecting, freezing, distributing, and re-installing data associated with notebook repos.
- Initializing notebook and terminal environment variables as spec'ed, partcularly regarding spec'ed/installed data which may be installed in a shared location.
- Building Docker images for curators or science platform admins or pipelines.
- Testiing environments (importing all requested package) and notebooks.
- Automating any/all of these tasks for notebook repos / curators and the science platforms.
