Ingestion workflow orchestrator

Description

This containerized application can be used to run workflows used for ingesting dataset metadata into a Dataverse instance. The different flows and tasks used in these workflows are created using Prefect. If you run the container locally they can be monitored and ran from the Prefect Orion UI at http://localhost:4200.

Most flows start with an entry workflow that can be found in the directory entry_workflows. Here the metadata is first harvested using OAI-PMH and uploaded to S3 storage. After, the metadata is fetched from that S3 storage, and a provenance object is created for the ingested metadata. A settings dictionary constructed with DynaConf that is specific to the Data Provider is also constructed here.

Next, For every dataset's metadata it runs a sub-flow to handle the actual ingestion. These flows can be found in the dataset_workflows directory. The dataset workflow uses simple tasks that make an API call to a service. These services often transform, improve or alter the metadata in some way.

In short, most ingestion workflows take the following steps:

Harvest metadata and upload it to S3 storage.
Fetch dataset metadata from S3 storage.
Create version object of all services that will be used for ingestion.
For every dataset's metadata run dataset workflow.
Use tasks that make API calls to different services to transform the metadata.

Services

In this section the different API services used in the workflows are shown. These services can be used in a workflow in different combinations, depending on the metadata provided by the data provider.

Service Name	Description	Deployment URL	GitHub Repo
Dataverse Mapper	Maps any JSON to Dataverse's JSON format.	https://dataverse-mapper.labs.dansdemo.nl/docs	GitHub
Dans Transformer Service	Transforms from XML to JSON (or from/to other formats).	https://transformer.labs.dansdemo.nl/docs	GitHub
Metadata Refiner	Refines JSON metadata.	https://metadata-enhancer.labs.dansdemo.nl/docs	GitHub
Metadata Enhancer	Enriches JSON metadata.	https://metadata-refiner.labs.dansdemo.nl/docs	GitHub
Email Sanitizer	Removes all emails from the metadata.	https://emailsanitizer.labs.dansdemo.nl/docs	GitHub
Version Tracker	Stores JSON containing version information.	https://version-tracker.labs.dansdemo.nl/docs	GitHub
DOI Minter	Mints a DOI for a dataset. Should be used with CAUTION since if used with production settings this will mint a permanent DOI.	https://dataciteminter.labs.dansdemo.nl/docs	GitHub
OAI-PMH Harvester	Harvester service to harvest the metadata from data providers using OAI-PMH.		GitHub
OAI Enricher Service	Enrich Dataverse OAI-PMH responses with additional data.	https://oai-service.labs.dansdemo.nl/docs	GitHub

Development

Available Make commands

Here is a set list of make command that can be used for easy setup:

make build: Build and start the project.
make start: Start the project in non-detached mode.
make startbg: Start the project in detached mode (background).
make down: Down the running project.
make dev-build: Build and start the development setup.
make dev-down: Down the ingest services in development mode.
make deploy: Deploy all ingestion workflows to the Prefect server.
make ingest: Run a specific ingest flow in Prefect with optional arguments for the target. It is also possible to specify if the metadata should be harvested. If not specified the metadata will be harvested.

Project setup

Development setup

If you want to develop new flows for the Ingestion Orchestrator you might want to run the services described above locally. This is possible in two formats.

Prefect stack only

For basic development without a full Dataverse portal:

cp dot_env_example .env
cp dot_env_development_example .env.development
make dev-build

This should set up the prefect container and the services used during the ingestion workflows.

Full ODISSEI stack

For complete development with a local ODISSEI Dataverse portal:

cp dot_env_example .env
cp dot_env_development_example .env.development
make dev-full-build

Should any issues arise with setting up the ODISSEI portal the recommendation is to run make clean-all and again running make dev-full-build.

Should the extraction of the ODISSEI API key fail then you could manually call make extract-dataverse-apikey.

Development Clean up.

Simply run the make command: make clean-all.

WARNING This will delete volumes, generated files, and networks.

Staging setup

cp dot_env_example .env
cp scripts/configuration/secrets_example.toml scripts/configuration/.secrets.toml
Add the necessary API tokens and credentials to the .secrets.toml
set ENV_FOR_DYNACONF in the .env to staging
make build

Running an ingestion via deploy

make deploy
Go to localhost:4200/deployments
Click the ellipsis icon of a workflow and select either custom run or quick run

If you've selected custom run you can optionally fill in a target url and key argument to specify a different target Dataverse. If you select quick run it will use the target in the settings in odissei_settings.toml and the key in .secrets.toml.

For the Dataverse ingestion pipeline, there is also a required argument for the settings_dict_name. The options for ingesting with Dataverse as both the source and target use the following input:

DANS datastation SSH, subset of only the social science datasets: 'DANS'

IISG's datasets: 'HSN'

Subverses of dataverse.nl: 'DELFT', 'AVANS', 'FONTYS', 'GRONINGEN', 'HANZE', 'HR' , 'LEIDEN', 'MAASTRICHT', 'TILBURG', 'UMCU', 'UTRECHT' , 'VU'

Setup scheduled deploys using .yaml files

The dataverse_deletion.yaml and dataverse_ingestion.yaml contain configuration for the deploy of the scheduled workflows. Deploying these yamls will setup the scheduled workflows and they will run automatically. Be careful with using this setup if this is not your intent. Deploy these yamls using the following command:

docker exec prefect-worker prefect deploy --prefect-file deployment/dataverse_ingestion.yaml --all

Running an ingestion via the command line

make ingest data_provider=CBS TARGET_URL=https://portal.example.odissei.nl TARGET_KEY=abcde123-11aa-22bb-3c4d-098765432abc DO_HARVEST=False
A prompt will show confirming the target
Type yes to continue or anything else to abort.

The make ingest command allows you to specify the url and API key of a specific target Dataverse. If you do not provide them, it will use the target in the settings in odissei_settings.toml and the key in .secrets.toml. It also allows you to specify if the pipeline should first harvest the metadata. This is useful for quick dev'ing after the metadata was already harvested or to rerun the bucket with metadata files from failed dataset workflows. Forcing a re-harvesting of all datasets can be accomplished using the FULL_HARVEST=True option. There is also an option to override the default bucket name by specifying the target bucket.

This is the list of data providers that can be used in the make ingest command:

'TWENTE', 'DELFT', 'AVANS', 'FONTYS', 'GRONINGEN', 'HANZE', 'HR', 'LEIDEN', 'MAASTRICHT', 'TILBURG', 'TRIMBOS', 'UMCU', 'UTRECHT', 'VU', 'DANS', 'CBS', 'LISS', 'HSN', 'CID'

Debugging, logging and failed workflows

Debugging services

To debug the services noted in the services table, use the development project setup. After, remove the service that you want to debug. This can be done in your docker interface or by using docker-compose stop <container_name> and replacing <container_name> with the name of the service you want to stop. After, go to the GitHub repository specified in the table for the service. Clone it and follow the instructions in the readme. Add the service to the ingest network with make network-add network_name=ingest container_name=<container_name>. Use a deployed flow or use make ingest to test any changes made to the service.

Logging

When running a flow the flow will produce logging information that can be viewed in the prefect UI. If the flow is ran from the command line it will also show the logs in the terminal. If you want to add logging, first use logger = get_run_logger() in the context of a running flow or task and use logger.info() to log any information.

Failed workflows

If an ingestion pipeline workflow is run for a specific data provider, it will create a sub flow all dataset metadata files retrieved from s3 storage. One sub flow ingests a single metadata file.

In the case that a sub flow fails, a bucket will be created using the data provider's name and the parent workflow (the ingestion pipeline workflow's) id. The metadata file that sub flow was ingesting will be stored in the bucket. Any other failed sub flows after that will also store their metadata file in this bucket.

This is done for two reasons:

Isolation of the failed metadata files for easier investigation.
Possibility to rerun only the metadata files of the failed dataset sub flows.

The second point requires the user to change the data provider's bucket name. For this, you can use the option to override the default bucket name by specifying the target bucket for the ingest. It can also be changed via the settings, which can be found in scripts/configuration/odissei_settings.toml.

When properly configured, every failed workflow will, besides creating a bucket, also result in a notification sent to a Slack channel. Currently the 'prefect-notifications' channel is on odissei-ingest Slack workspace.
For more details on how notifications handling is setup; see notifications.md.

Follow these steps to run the failed metadata ingest:

Find the bucket created for the failed metadata in the logs (with that workflow id at the end).
Use that bucket name in the ingest command. Less convenient, but possible, is
to temporary change the <data provider>_BUCKET_NAME to that bucket name, where is the data provider you ran the ingestion for.
run make ingest TARGET_BUCKET=<bucket with failures> DO_HARVEST=False, so that you don't harvest the metadata from the data provider into the specified bucket.

Minio file storage

The metadata that is used by the workflows is stored in s3 buckets. The key, id and url of the server of the s3 storage should be set in the .secrets.toml as MINIO_SECRET, MINIO_KEY and MINIO_SERVER_URL respectively.

For a specific data provider a BUCKET_NAME should be added for that provider. The bucket in s3 storage that contains the metadata for the provider should use the same name as the BUCKET_NAME for that provider.

example in odissei_settings.toml:

HSN_BUCKET_NAME="hsn-metadata"

HSN={"ALIAS"="HSN_NL", "BUCKET_NAME"="@format {this.HSN_BUCKET_NAME}", "SOURCE_DATAVERSE_URL"="@format {this.IISG_URL}", "DESTINATION_DATAVERSE_URL"="@format {this.ODISSEI_URL}", "DESTINATION_DATAVERSE_API_KEY"="@format {this.ODISSEI_API_KEY}", "REFINER_ENDPOINT"="@format {this.HSN_REFINER_ENDPOINT}"}

In this example, HSN contains all information relating to settings specific to ingesting the HSN metadata. The BUCKET_NAME set in the HSN dictionary can be generically used in the code when a bucket name is necessary. It is set to the HSN_BUCKET_NAME which declares the specific name for the bucket for HSN. Further explanation on the settings can be found in Settings files section.

Dataverse

TODO: describe that we now use tagged images from IQSS dockerhub, and this is handled by odissei-data/odissei-dataverse-stack.

A local Dataverse instance makes it easy to deposit via the API.

https://github.com/IQSS/dataverse-docker

Only a Super User can deposit via the API.

Set the superuser boolean to true in the authenticateduser table. You are now a Super User.

More information on how to do this can be found in the documentation of the ODISSEI dataverse stack here.

If you use a containerized Dataverse instance it should live in the same network as the dev services.

Dynaconf

The Ingestion Workflow Orchestrator uses Dynaconf to manage its settings. This chapter will give a very short introduction on Dynaconf. For more information read the docs.

Use the .env file to set the environment to either development, staging or production. Be careful that setting the env to production will mean that all flows that use the DOI-minter will be minting persistent DOI's.

ENV_FOR_DYNACONF=development

Settings files

The settings are split into multiple toml files. This makes it easier to manage a large amount of settings. You can specify which files are loaded in config.py. The files are loaded in order and overwrite each other if they share settings with the same name.

settings.toml, contains the base settings
.secrets.toml, contains all secrets
_settings.toml, datastation specific settings

Each file is split into multiple sections: default, development, production. Default settings are always loaded and usually contain one or more dynamic parts using @format. Development and production contain the values that depend on the current environment.

The example below shows how dynamic settings work. The metadata directory changes based on the current environment.

[default]
"BUCKET_NAME" = "@format {this.BUCKET_NAME}"

[development]
"BUCKET_NAME" = "path/to/local/dir"

[production]
"BUCKET_NAME" = "path/to/s3/bucket"

Dataset Workflow Examples

CBS Metadata Ingestion Workflow

The CBS Metadata Ingestion Workflow is responsible for ingesting metadata from the CBS (Central Bureau of Statistics) data provider into Dataverse. It processes the XML metadata, transforms it into JSON format, maps it to the required format for Dataverse, refines and enriches the metadata, mints a DOI, and finally imports the dataset into Dataverse. The workflow is implemented using Prefect, a workflow management library in Python.

Workflow Steps

Email Sanitizer: The XML metadata is passed through the Email Sanitizer service to remove any sensitive email information.
XML to JSON Transformation: The sanitized XML metadata is transformed into JSON format using the Dans Transformer Service.
Metadata Mapping: The JSON metadata is mapped to the required format for Dataverse using the Dataverse Mapper service.
Metadata Refinement: The mapped metadata is refined using the Metadata Refiner service. In CBS's case this means the Alternative Titles and Keywords are improved.
Workflow Versioning: The workflow versioning URL is added to the metadata using the Version Tracker service. This step ensures that the metadata includes information about the services that processed it.
DOI Minting: The metadata is passed to the DOI Minter service, which mints a DOI (Digital Object Identifier) for the dataset.
Metadata Enrichment: The metadata is enriched using two different endpoints of the Metadata Enhancer service. Each service adds specific enrichment to the metadata.
Dataverse Import: The enriched metadata, along with the DOI, is imported into Dataverse using the Dataverse Importer service.
Publication Date Update: The publication date is extracted from the metadata using a JMESPath query. If a valid publication date is found, it is passed to the Publication Date Updater service, which updates the publication date of the dataset in Dataverse.
Semantic Enrichment: The workflow performs semantic enrichment using the Semantic Enrichment service. The enrichment process adds additional information to the SOLR index using ELSST translations of the keywords.
Workflow Completion: If all the previous steps are completed successfully, the workflow is considered completed, indicating that the dataset has been ingested successfully, including the DOI.

Please note that each service mentioned in the workflow corresponds to the services listed in the table provided earlier.

Name		Name	Last commit message	Last commit date
Latest commit History 526 Commits
.github		.github
boto_utils		boto_utils
resources @ a2e4d2a		resources @ a2e4d2a
scripts		scripts
tests/workflows		tests/workflows
traefik		traefik
.env.worker		.env.worker
.gitignore		.gitignore
.gitmodules		.gitmodules
Base		Base
Dockerfile.server		Dockerfile.server
Dockerfile.worker		Dockerfile.worker
LICENSE.md		LICENSE.md
Makefile		Makefile
codemeta.json		codemeta.json
docker-compose-dev.yml		docker-compose-dev.yml
docker-compose.yml		docker-compose.yml
dot_env_development_example		dot_env_development_example
dot_env_example		dot_env_example
dot_env_worker_example		dot_env_worker_example
entrypoint-worker.sh		entrypoint-worker.sh
entrypoint.sh		entrypoint.sh
notifications.md		notifications.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
readme.md		readme.md
traefik.toml		traefik.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ingestion workflow orchestrator

Description

Services

Development

Available Make commands

Project setup

Development setup

Prefect stack only

Full ODISSEI stack

Development Clean up.

Staging setup

Running an ingestion via deploy

Setup scheduled deploys using .yaml files

Running an ingestion via the command line

Debugging, logging and failed workflows

Debugging services

Logging

Failed workflows

Minio file storage

Dataverse

Dynaconf

Settings files

Dataset Workflow Examples

CBS Metadata Ingestion Workflow

Workflow Steps

About

Uh oh!

Releases 9

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ingestion workflow orchestrator

Description

Services

Development

Available Make commands

Project setup

Development setup

Prefect stack only

Full ODISSEI stack

Development Clean up.

Staging setup

Running an ingestion via deploy

Setup scheduled deploys using .yaml files

Running an ingestion via the command line

Debugging, logging and failed workflows

Debugging services

Logging

Failed workflows

Minio file storage

Dataverse

Dynaconf

Settings files

Dataset Workflow Examples

CBS Metadata Ingestion Workflow

Workflow Steps

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages