nb-wrangler

Overview

nb-wrangler is a command-line tool that streamlines the curation of JupyterLab notebooks and their runtime environments. It automates the process of building and testing container images from notebook requirements, ensuring that notebooks have the correct dependencies to run successfully.

Key features include:

Environment Management: Bootstraps its own dedicated Conda environments, isolated from your system's Python.
Dependency Resolution: Compiles requirements.txt files from multiple notebooks into a single, consistent set of versioned dependencies.
Automated Testing: Tests notebooks and their package imports to verify the environment.
Data Management: Manages the data required to run notebooks.
Image Building: Integrates with a build system to automatically create container images.
Local Installs: Works equally well doing local installs with no Docker overhead or learning curve.

The project uses micromamba for environment management and uv for fast pip package installation.

Note that while nb-wrangler was conceived as a way to streamline notebook Docker image creation for JupyterHub, at it's core nb-wrangler is merely defining:

A set of notebooks particularly relevant to a science platform.
Any supporting data required to run those notebooks.
A Python environment capable of running the entire set of notebooks.
Tests to help verify the system is built correctly and correctly runs those notebooks.
Standard methods to locally install the notebooks, data, and Python environment and run the tests.

All of the above don't even build Docker images directly, but nb-wrangler does provide two ways to hand off the information to STScI's science-platform-images GitHub repository which can autonomously or manually build an image from a wrangler spec.

Two other points are worthy of note:

nb-wrangler can support easy installation of custom environments directly on the science platform that were not first pre-installed in the platform Docker image. This can be exploited to set up shared global or team installations areas as well as personalized environments.
The network distribution and installation protocols used equally enable off-platform laptop users to set up the same environment locally in an easy manner.

Installation

Locally / On your laptop

Pip Installing (experimental)

nb-wrangler now supports pip installs into existing mamba environments:

pip install nb-wrangler

This is a relatively new feature and because the wrangler itself works by creating multiple environments it is not always as reliable as the bootstrap method described below. If you do use this method it's advisable to create a dedicated nbwrangler environment to install to.

Bootstrapping an Isolated Environment

To get started, bootstrap nb-wrangler to create the necessary environments and directories (by default in $HOME/.nbw-live):

curl https://raw.githubusercontent.com/spacetelescope/nb-wrangler/refs/heads/main/nb-wrangler >nb-wrangler
chmod +x nb-wrangler
./nb-wrangler bootstrap

After bootstrapping, you can activate and/or reactivate the nbwrangler environment with:

source ./nb-wrangler environment

This command sets up the shell environment and activates the nbwrangler Python environment so that it (temporarily) replaces any other Python you had activated previously and is ready to start executing wrangler commands. It should be mentioned that when the wrangler creates target environments from the spec it will install them indepdently to this nbwrangler environment which is intended only to support the tool itself.

General Notes on Installing

`wget` is needed for Data Handling

The data download mechanisms in nb-wrangler utilize wget under the hood to fetch data URL's. If your system does not already have it installed, you need to install it before you can curate or reinstall data. On Linux (including the science platforms) wget is almost certainly there but if not it should be an easy package install. On OS-X wget is an easy install via brew and also via mamba. Before working on data tasks, it's worth verifying wget is available on your PATH.

nb-wrangler On the STScI Science Platform

This is viable but the exact environment settings and required workflows are still being formalized.

If you're curious contact [email protected] and we will work out platform and image appropriate instructions for doing in-situ development of platform environments using nb-wrangler.

How It Works

The nb-wrangler workflow is divided into two main phases: curation and reinstallation.

Phase 1: Curation

Curation is the process of defining the notebooks, Python packages, and data required for a specific environment. This is done by creating a nbw-spec.yaml file that describes the desired environment. Typically notebook repository maintainers perform these steps in addition their fundamental roles of producing correct notebooks, pip requirements, and installable data.

The main curation workflows are:

--curate: Compiles notebook requirements, creates the mamba environment, and installs pip dependencies.
--data-curate: Gathers data requirements from notebook repositories and adds them to the spec.
--test, --test-imports, --test-notebooks: Tests the notebook imports and notebooks themselves in the context of the environment and data installation.

Example:

# Curate the software environment
./nb-wrangler spec.yaml --curate

# Test environment basics rapidly
./nb-wrangler spec.yaml --test-imports

# Curate the data dependencies
./nb-wrangler spec.yaml --data-curate

# Run each notebook headless using papermill
./nb-wrangler spec.yaml --test-notebooks

# Activate your new "target" environment!  Here kernel-name == mamba environment you are curating
source ./nb-wrangler activate <your-kernel-name>

# Deactivate your current environment
source ./nb-wrangler deactivate

The curation process involves:

Choosing notebooks: Selecting the notebooks to be included in the environment.
Resolving dependencies: Identifying and resolving any conflicts between Python packages.
Defining data sources: Specifying the data required by the notebooks.
Testing: Building the environment and testing the notebooks to ensure they run correctly.

Phase 2: Reinstallation

Reinstallation is the process of creating a new environment from a completed spec.yaml file. This is useful for reproducing an environment on a different machine or for a different user.

The main reinstallation workflows are:

--reinstall: Recreates the software environment from a spec.
--data-reinstall: Installs the data required by the notebooks.
--test, --test-imports, --test-notebooks: Tests the notebook imports and notebooks themselves within the defined environment and data.

Example:

# Reinstall the software environment
./nb-wrangler spec.yaml --reinstall

# Reinstall the data
./nb-wrangler spec.yaml --data-reinstall

# Run both import and notebook tests
./nb-wrangler spec.yaml --test-all

# Activate your new "target" environment!  Here kernel-name == mamba environment you are curating
source ./nb-wrangler activate <your-kernel-name>

# Deactivate your current environment
source ./nb-wrangler deactivate

For both curation and reinstallation there is the assumption that tests may fail and it may be necessary to circle back to earlier steps, make fixes, and iterate.

For more information on notebooks and environment curation see Managing Notebook Selection and Environment For more information on supporting data see Managing Notebook Reference Data

Advanced Usage

Development Overrides

To streamline development with custom branches without altering your core spec.yaml, nb-wrangler supports dev_overrides.

The dev_overrides section in spec.yaml allows you to temporarily specify development branches for repositories.
Use the --dev flag (or rely on implicit activation for curation workflows) to apply these overrides.
Use --finalize-dev-overrides to remove the dev_overrides section when preparing for production.

For more details, see the Spec Format documentation.

SPI Injection

nb-wrangler can also inject the package and test requirements from a spec into the classic Science Platform Images (SPI) repository layout. This is a transitional feature to support older build processes.

See the SPI Injection documentation for more details.

Configuration Options

nb-wrangler provides a wide range of command-line options to customize its behavior. Here are some of the most common ones, grouped by function:

Workflows

Workflows are commands that execute an ordered sequence of steps to accomplish some end-to-end task:

--curate: Run the full curation workflow to define notebooks and Python environment.
--reinstall: Reinstall an environment from a spec.
--reset-curation: Delete installation artifacts like the environment, install caches, and spec updates.
--data-curate: Curate data dependencies.
--data-reinstall: Reinstall data dependencies.
--submit-for-build: Submit a spec for an automated image build.
--inject-spi: Inject a spec into the SPI repository.

Environment Management

--env-init: Create and kernelize the target environment.
--env-delete: Delete the target environment.
--env-archive-delete: Delete the "code" portions of the pantry for the current spec (environment archives).
--env-pack: Pack the target environment into an archive file.
--env-unpack: Unpack an environment from an archive.
--env-register: Register the environment as a Jupyter kernel.
--env-unregister: Unregister the environment from Jupyter.
--env-compact: Compact the wrangler installation by deleting package caches.
--env-archive-format: Override format for environment pack/unpack.
--env-print-name: Print the environment name for the spec.

Package Management

--packages-compile: Compile package requirements.
--packages-install: Install packages into the environment.
--packages-uninstall: Uninstall packages from the environment.
--packages-omit-spi: Don't include 'common' SPI packages.
--packages-diagnostics: Show which requirements files are included and their required packages.

Testing

-t, --test-all: Run all tests (--test-imports and --test-notebooks).
--test-imports: Test package imports.
--test-notebooks [REGEX]: Test notebook execution. Can optionally take a comma-separated list of regex patterns to select specific notebooks.
--test-notebooks-exclude [REGEX]: Exclude notebooks from testing using a comma-separated list of regex patterns.
--jobs INT: Number of parallel jobs for notebook testing.
--timeout INT: Timeout in seconds for notebook tests.

Data Management

--data-collect: Collect data archive and installation info and add to spec.
--data-list: List data archives.
--data-download: Download data archives to the pantry.
--data-update: Update metadata for data archives (e.g., length and hash).
--data-validate: Validate pantry archives against the spec.
--data-unpack: Unpack data archives.
--data-pack: Pack live data directories into archive files.
--data-reset-spec: Clear the 'data' sub-section of the 'out' section of the spec.
--data-delete [archived|unpacked|both]: Delete data archives and/or unpacked files.
--data-env-vars-mode [pantry|spec]: Define where to locate unpacked data.
--data-print-exports: Print shell exports for data environment variables.
--data-env-vars-no-auto-add: Do not automatically add data environment variables to the runtime environment.
--data-select [REGEX]: Regex to select specific data archives.
--data-no-validation: Skip data validation.
--data-no-unpack-existing: Skip unpack if the target directory exists.
--data-symlink-install-data: Create symlinks from install locations to the pantry data directory.

Notebook Clones

--clone-repos: Clone notebook repositories.
--repos-dir: Directory for cloned repositories.
--delete-repos: Delete cloned repositories.

Spec Management

--spec-reset: Reset the spec file to its original state (preserves out.data).
--spec-add: Add the spec to the pantry (a local collection of specs).
--spec-list: List available specs in the pantry.
--spec-select [REGEX]: Select a spec from the pantry by regex.
--spec-validate: Validate the spec file.
--spec-update-hash: Update spec SHA256 hash.
--spec-ignore-hash: Do not add or verify the spec hash.
--spec-add-pip-hashes: Record PyPI hashes for packages during compilation.

Miscellaneous

--verbose: Enable DEBUG log output.
--debug: Drop into debugger on exceptions.
--profile: Run with cProfile and print stats.
--reset-log: Delete the log file.
--log-times: Configure timestamps in log messages.
--color: Colorize log output.

For a full list of options, run ./nb-wrangler --help.

Input Formats

nb-wrangler uses several input formats to define the environment:

Notebook (.ipynb): Jupyter notebooks.
Wrangler Spec (spec.yaml): The main YAML file that defines the notebook repositories and Python environment. See the spec format documentation for details on the new format, which uses a repositories dictionary and named selected_notebooks blocks.
Notebook Repo: A Git repository containing Jupyter notebooks. e.g., TIKE Content, Roman Notebooks
Science Platform Images (SPI): The GitHub repository where code for the docker images for the Science Platforms is kept. Science Platform Images
Refdata Spec (refdata_dependencies.yaml): A YAML file in a notebook repository that specifies data dependencies. See the refdata dependencies documentation.
Requirements (requirements.txt): A file specifying Python package dependencies for a notebook in its directory.
Supporting Python (.py): Any supporting Python files (.py) included in a notebook's directory.

The goal of nb-wrangler is to combine these inputs, resolve any conflicts, and create a unified environment capable of running all specified notebooks.

Secondary goals, include but are not limited to:

Collecting, freezing, distributing, and re-installing data associated with notebook repos.
Initializing notebook and terminal environment variables as spec'ed, partcularly regarding spec'ed/installed data which may be installed in a shared location.
Building Docker images for curators or science platform admins or pipelines.
Testiing environments (importing all requested package) and notebooks.
Automating any/all of these tasks for notebook repos / curators and the science platforms.

Name		Name	Last commit message	Last commit date
Latest commit History 506 Commits
.github/workflows		.github/workflows
docs		docs
nb_wrangler		nb_wrangler
prod-specs		prod-specs
sample-specs		sample-specs
tests		tests
.gitignore		.gitignore
.nb-wrangler.repo.id		.nb-wrangler.repo.id
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
nb-wrangler		nb-wrangler
nbw-wrangler-spec.yaml		nbw-wrangler-spec.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nb-wrangler

Overview

Installation

Locally / On your laptop

Pip Installing (experimental)

Bootstrapping an Isolated Environment

General Notes on Installing

`wget` is needed for Data Handling

nb-wrangler On the STScI Science Platform

How It Works

Phase 1: Curation

Phase 2: Reinstallation

Advanced Usage

Development Overrides

SPI Injection

Configuration Options

Workflows

Environment Management

Package Management

Testing

Data Management

Notebook Clones

Spec Management

Miscellaneous

Input Formats

About

Uh oh!

Releases 5

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

nb-wrangler

Overview

Installation

Locally / On your laptop

Pip Installing (experimental)

Bootstrapping an Isolated Environment

General Notes on Installing

wget is needed for Data Handling

nb-wrangler On the STScI Science Platform

How It Works

Phase 1: Curation

Phase 2: Reinstallation

Advanced Usage

Development Overrides

SPI Injection

Configuration Options

Workflows

Environment Management

Package Management

Testing

Data Management

Notebook Clones

Spec Management

Miscellaneous

Input Formats

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`wget` is needed for Data Handling

Packages