This containerized application can be used to run workflows used for ingesting dataset metadata into a Dataverse instance. The different flows and tasks used in these workflows are created using Prefect. If you run the container locally they can be monitored and ran from the Prefect Orion UI at http://localhost:4200.
Most flows start with an entry workflow that can be found in the directory entry_workflows. Here the metadata is first harvested using OAI-PMH and uploaded to S3 storage. After, the metadata is fetched from that S3 storage, and a provenance object is created for the ingested metadata. A settings dictionary constructed with DynaConf that is specific to the Data Provider is also constructed here.
Next, For every dataset's metadata it runs a sub-flow to handle the actual ingestion. These flows can be found in the dataset_workflows directory. The dataset workflow uses simple tasks that make an API call to a service. These services often transform, improve or alter the metadata in some way.
In short, most ingestion workflows take the following steps:
- Harvest metadata and upload it to S3 storage.
- Fetch dataset metadata from S3 storage.
- Create version object of all services that will be used for ingestion.
- For every dataset's metadata run dataset workflow.
- Use tasks that make API calls to different services to transform the metadata.
In this section the different API services used in the workflows are shown. These services can be used in a workflow in different combinations, depending on the metadata provided by the data provider.
| Service Name | Description | Deployment URL | GitHub Repo |
|---|---|---|---|
| Dataverse Mapper | Maps any JSON to Dataverse's JSON format. | https://dataverse-mapper.labs.dansdemo.nl/docs | GitHub |
| Dans Transformer Service | Transforms from XML to JSON (or from/to other formats). | https://transformer.labs.dansdemo.nl/docs | GitHub |
| Metadata Refiner | Refines JSON metadata. | https://metadata-enhancer.labs.dansdemo.nl/docs | GitHub |
| Metadata Enhancer | Enriches JSON metadata. | https://metadata-refiner.labs.dansdemo.nl/docs | GitHub |
| Email Sanitizer | Removes all emails from the metadata. | https://emailsanitizer.labs.dansdemo.nl/docs | GitHub |
| Version Tracker | Stores JSON containing version information. | https://version-tracker.labs.dansdemo.nl/docs | GitHub |
| DOI Minter | Mints a DOI for a dataset. Should be used with CAUTION since if used with production settings this will mint a permanent DOI. | https://dataciteminter.labs.dansdemo.nl/docs | GitHub |
| OAI-PMH Harvester | Harvester service to harvest the metadata from data providers using OAI-PMH. | GitHub | |
| OAI Enricher Service | Enrich Dataverse OAI-PMH responses with additional data. | https://oai-service.labs.dansdemo.nl/docs | GitHub |
Here is a set list of make command that can be used for easy setup:
make build: Build and start the project.make start: Start the project in non-detached mode.make startbg: Start the project in detached mode (background).make down: Down the running project.make dev-build: Build and start the development setup.make dev-down: Down the ingest services in development mode.make deploy: Deploy all ingestion workflows to the Prefect server.make ingest: Run a specific ingest flow in Prefect with optional arguments for the target. It is also possible to specify if the metadata should be harvested. If not specified the metadata will be harvested.
If you want to develop new flows for the Ingestion Orchestrator you might want to run the services described above locally. This is possible in two formats.
For basic development without a full Dataverse portal:
cp dot_env_example .envcp dot_env_development_example .env.developmentmake dev-build
This should set up the prefect container and the services used during the ingestion workflows.
For complete development with a local ODISSEI Dataverse portal:
cp dot_env_example .envcp dot_env_development_example .env.developmentmake dev-full-build
Should any issues arise with setting up the ODISSEI portal the recommendation is to run make clean-all and again running make dev-full-build.
Should the extraction of the ODISSEI API key fail then you could manually call make extract-dataverse-apikey.
Simply run the make command: make clean-all.
WARNING This will delete volumes, generated files, and networks.
cp dot_env_example .envcp scripts/configuration/secrets_example.toml scripts/configuration/.secrets.toml- Add the necessary API tokens and credentials to the .secrets.toml
- set
ENV_FOR_DYNACONFin the.envtostaging make build
make deploy- Go to localhost:4200/deployments
- Click the ellipsis icon of a workflow and select either custom run or quick run
If you've selected custom run you can optionally fill in a target url and
key argument to specify a different target Dataverse.
If you select quick run it will use the target in the settings in
odissei_settings.toml and the key in .secrets.toml.
For the Dataverse ingestion pipeline, there is also a required argument for
the settings_dict_name. The options for ingesting with Dataverse as both the
source and target use the following input:
DANS datastation SSH, subset of only the social science datasets:
'DANS'
IISG's datasets: 'HSN'
Subverses of dataverse.nl:
'DELFT', 'AVANS', 'FONTYS', 'GRONINGEN', 'HANZE', 'HR'
, 'LEIDEN', 'MAASTRICHT', 'TILBURG', 'UMCU', 'UTRECHT'
, 'VU'
The dataverse_deletion.yaml and dataverse_ingestion.yaml contain configuration for the deploy of the scheduled workflows. Deploying these yamls will setup the scheduled workflows and they will run automatically. Be careful with using this setup if this is not your intent. Deploy these yamls using the following command:
docker exec prefect-worker prefect deploy --prefect-file deployment/dataverse_ingestion.yaml --all
make ingest data_provider=CBS TARGET_URL=https://portal.example.odissei.nl TARGET_KEY=abcde123-11aa-22bb-3c4d-098765432abc DO_HARVEST=False- A prompt will show confirming the target
- Type yes to continue or anything else to abort.
The make ingest command allows you to specify the url and API key of a
specific target Dataverse. If you do not provide them, it will use the target
in the settings in odissei_settings.toml and the key in .secrets.toml. It also
allows you to specify if the pipeline should first harvest the metadata.
This is useful for quick dev'ing after the metadata was already harvested or
to rerun the bucket with metadata files from failed dataset workflows.
Forcing a re-harvesting of all datasets can be accomplished using the FULL_HARVEST=True option.
There is also an option to override the default bucket name by specifying the target bucket.
This is the list of data providers that can be used in the make ingest command:
'TWENTE', 'DELFT', 'AVANS', 'FONTYS', 'GRONINGEN', 'HANZE', 'HR', 'LEIDEN', 'MAASTRICHT', 'TILBURG', 'TRIMBOS', 'UMCU', 'UTRECHT', 'VU', 'DANS', 'CBS', 'LISS', 'HSN', 'CID'
To debug the services noted in the services table, use the development project
setup. After, remove the service that you want to debug.
This can be done in your docker interface or by using docker-compose stop <container_name>
and replacing <container_name> with the name of the service you want to stop.
After, go to the GitHub repository specified in the table for the service.
Clone it and follow the instructions in the readme. Add the service to the
ingest network with make network-add network_name=ingest container_name=<container_name>.
Use a deployed flow or use make ingest to test any changes made to the service.
When running a flow the flow will produce logging information that can be viewed in the prefect UI. If the flow is ran from the command line it will also show the logs in the terminal.
If you want to add logging, first use logger = get_run_logger() in the context of a running flow or task and use logger.info() to log any information.
If an ingestion pipeline workflow is run for a specific data provider, it will create a sub flow all dataset metadata files retrieved from s3 storage. One sub flow ingests a single metadata file.
In the case that a sub flow fails, a bucket will be created using the data provider's name and the parent workflow (the ingestion pipeline workflow's) id. The metadata file that sub flow was ingesting will be stored in the bucket. Any other failed sub flows after that will also store their metadata file in this bucket.
This is done for two reasons:
- Isolation of the failed metadata files for easier investigation.
- Possibility to rerun only the metadata files of the failed dataset sub flows.
The second point requires the user to change the data provider's bucket name.
For this, you can use the option to override the default bucket name by specifying the target bucket for the ingest.
It can also be changed via the settings, which can be found in scripts/configuration/odissei_settings.toml.
When properly configured, every failed workflow will, besides creating a bucket, also result in a notification sent to a Slack channel.
Currently the 'prefect-notifications' channel is on odissei-ingest Slack workspace.
For more details on how notifications handling is setup; see notifications.md.
Follow these steps to run the failed metadata ingest:
- Find the bucket created for the failed metadata in the logs (with that workflow id at the end).
- Use that bucket name in the ingest command.
Less convenient, but possible, is
to temporary change the<data provider>_BUCKET_NAMEto that bucket name, where is the data provider you ran the ingestion for. - run
make ingest TARGET_BUCKET=<bucket with failures> DO_HARVEST=False, so that you don't harvest the metadata from the data provider into the specified bucket.
The metadata that is used by the workflows is stored in s3 buckets. The key, id
and url of the server of the s3 storage should be set in the .secrets.toml as
MINIO_SECRET, MINIO_KEY and MINIO_SERVER_URL
respectively.
For a specific data provider a BUCKET_NAME should be added for that provider.
The bucket in s3 storage that contains the metadata for the provider should use
the same name as the BUCKET_NAME for that provider.
example in odissei_settings.toml:
HSN_BUCKET_NAME="hsn-metadata"
HSN={"ALIAS"="HSN_NL", "BUCKET_NAME"="@format {this.HSN_BUCKET_NAME}", "SOURCE_DATAVERSE_URL"="@format {this.IISG_URL}", "DESTINATION_DATAVERSE_URL"="@format {this.ODISSEI_URL}", "DESTINATION_DATAVERSE_API_KEY"="@format {this.ODISSEI_API_KEY}", "REFINER_ENDPOINT"="@format {this.HSN_REFINER_ENDPOINT}"}
In this example, HSN contains all information relating to settings specific to ingesting the HSN metadata. The BUCKET_NAME set in the HSN dictionary can be generically used in the code when a bucket name is necessary. It is set to the HSN_BUCKET_NAME which declares the specific name for the bucket for HSN. Further explanation on the settings can be found in Settings files section.
TODO: describe that we now use tagged images from IQSS dockerhub, and this is handled by odissei-data/odissei-dataverse-stack.
A local Dataverse instance makes it easy to deposit via the API.
https://github.com/IQSS/dataverse-docker
Only a Super User can deposit via the API.
Set the superuser boolean to true in the authenticateduser table. You are
now a Super User.
More information on how to do this can be found in the documentation of the ODISSEI dataverse stack here.
If you use a containerized Dataverse instance it should live in the same network as the dev services.
The Ingestion Workflow Orchestrator uses Dynaconf to manage its settings. This chapter will give a very short introduction on Dynaconf. For more information read the docs.
Use the .env file to set the environment to either development, staging or
production. Be careful that setting the env to production will mean that all
flows that use the DOI-minter will be minting persistent DOI's.
ENV_FOR_DYNACONF=development
The settings are split into multiple toml files. This makes it easier to manage
a large amount of settings. You can specify which files are loaded in
config.py. The files are loaded in order and overwrite each other if they
share settings with the same name.
- settings.toml, contains the base settings
- .secrets.toml, contains all secrets
- _settings.toml, datastation specific settings
Each file is split into multiple sections: default, development, production.
Default settings are always loaded and usually contain one or more dynamic
parts using @format. Development and production contain the values that
depend on the current environment.
The example below shows how dynamic settings work. The metadata directory changes based on the current environment.
[default]
"BUCKET_NAME" = "@format {this.BUCKET_NAME}"
[development]
"BUCKET_NAME" = "path/to/local/dir"
[production]
"BUCKET_NAME" = "path/to/s3/bucket"The CBS Metadata Ingestion Workflow is responsible for ingesting metadata from the CBS (Central Bureau of Statistics) data provider into Dataverse. It processes the XML metadata, transforms it into JSON format, maps it to the required format for Dataverse, refines and enriches the metadata, mints a DOI, and finally imports the dataset into Dataverse. The workflow is implemented using Prefect, a workflow management library in Python.
-
Email Sanitizer: The XML metadata is passed through the Email Sanitizer service to remove any sensitive email information.
-
XML to JSON Transformation: The sanitized XML metadata is transformed into JSON format using the Dans Transformer Service.
-
Metadata Mapping: The JSON metadata is mapped to the required format for Dataverse using the Dataverse Mapper service.
-
Metadata Refinement: The mapped metadata is refined using the Metadata Refiner service. In CBS's case this means the Alternative Titles and Keywords are improved.
-
Workflow Versioning: The workflow versioning URL is added to the metadata using the Version Tracker service. This step ensures that the metadata includes information about the services that processed it.
-
DOI Minting: The metadata is passed to the DOI Minter service, which mints a DOI (Digital Object Identifier) for the dataset.
-
Metadata Enrichment: The metadata is enriched using two different endpoints of the Metadata Enhancer service. Each service adds specific enrichment to the metadata.
-
Dataverse Import: The enriched metadata, along with the DOI, is imported into Dataverse using the Dataverse Importer service.
-
Publication Date Update: The publication date is extracted from the metadata using a JMESPath query. If a valid publication date is found, it is passed to the Publication Date Updater service, which updates the publication date of the dataset in Dataverse.
-
Semantic Enrichment: The workflow performs semantic enrichment using the Semantic Enrichment service. The enrichment process adds additional information to the SOLR index using ELSST translations of the keywords.
-
Workflow Completion: If all the previous steps are completed successfully, the workflow is considered completed, indicating that the dataset has been ingested successfully, including the DOI.
Please note that each service mentioned in the workflow corresponds to the services listed in the table provided earlier.