Datalens: Semantic Model and Knowledge Graph for ML Resources

Datalens provides a semantic model, knowledge graph construction pipeline, SPARQL competency questions, and interactive visualizations for exploring machine learning resources such as datasets and models.

The project focuses on describing resources published on platforms such as Hugging Face, aligning their metadata with controlled vocabularies, and enabling semantic exploration of tasks, modalities, licenses, provenance links, and popularity indicators.

Overview

Datalens provides an ontology and a thesaurus to:

Model machine learning datasets and models as semantic resources
Describe resources with tasks, subtasks, modalities, formats, libraries, licenses, languages, regions, and scholarly references
Capture provenance relationships, including training-data links and model derivations
Build an RDF knowledge graph from Hugging Face metadata
Support competency-question analysis through SPARQL queries and Venus visualizations

The Datalens ontology namespace is http://ns.inria.fr/datalens/ontology/.

The Datalens thesaurus namespace is http://ns.inria.fr/datalens/thesaurus/.

The Datalens-based KG is publicly available through a SPARQL endpoint at: http://graph.i3s.fr/repositories/datalens.

RDF Data Modeling

The ontology directory contains the semantic model:

datalens_o.ttl: OWL ontology for machine learning resources, including datasets, models, distributions, annotations, libraries, tasks, modalities, and provenance relationships.
datalens_th.ttl: SKOS thesaurus with controlled vocabularies for tasks, subtasks, modalities, formats, size categories, libraries, and transformation types.

See ontology/README.md for a summary of the main classes and concept schemes.

Knowledge Graph Construction

The kg directory contains the pipeline used to build the Datalens knowledge graph from Hugging Face metadata.

The pipeline:

Fetches dataset and model metadata from the Hugging Face Hub API
Splits large JSON collections into batches
Normalizes tags and metadata fields
Aligns raw metadata values with Datalens thesaurus concepts
Generates stable identifiers for resources and related entities
Lifts processed JSON metadata to RDF/Turtle with XR2RML mappings

See kg/README.md for requirements and execution details.

Expert Validation of SKOS thesaurus

To validate the ML task hierarchy, we conducted an expert evaluation with IT researchers and developers.

See expert-validation for more details on the protocol and obtained results.

Competency Questions and Visualizations

The sparql-examples directory contains SPARQL implementations of the competency questions used to inspect the knowledge graph.

The current competency questions cover:

Datasets supporting a given machine learning task
Datasets supporting a task for a specific modality under constraints
Provenance relationships between models, datasets, and derived resources
Popularity indicators for datasets and models

The vis directory contains a Vite-based dashboard that uses Venus elements to visualize those competency questions. Each visualization loads its query from the matching .rq file in sparql-examples, keeping the SPARQL examples and the dashboard coherent.

To run the visualization dashboard:

cd vis
npm install
npm run dev

An interactive dashboard presenting the CQ visualizations is available here.

Directory Structure

datalens/
├── ontology/              # OWL ontology and SKOS thesaurus
│   ├── datalens_o.ttl
│   ├── datalens_th.ttl
│   └── README.md
│
├── kg/                    # Knowledge graph construction pipeline
│   ├── fechting/          # Hugging Face metadata fetching and batching scripts
│   ├── processing/        # Metadata normalization and cleanup scripts
│   ├── lifting/           # XR2RML mappings and lifting orchestration
│   ├── requirements.txt
│   └── README.md
│
├── sparql-examples/       # Competency-question SPARQL queries
│   ├── cq1.rq
│   ├── cq2.rq
│   ├── cq3.rq
│   ├── cq4.rq
│   └── README.md
│
└── vis/                   # Venus dashboard for CQ visualizations
    ├── index.html
    ├── css/
    ├── js/
    └── package.json

License

This project is licensed under the Apache License 2.0. See the LICENSE file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Datalens: Semantic Model and Knowledge Graph for ML Resources

Overview

RDF Data Modeling

Knowledge Graph Construction

Expert Validation of SKOS thesaurus

Competency Questions and Visualizations

Directory Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
expert-validation		expert-validation
kg		kg
ontology		ontology
sparql-examples		sparql-examples
vis		vis
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Datalens: Semantic Model and Knowledge Graph for ML Resources

Overview

RDF Data Modeling

Knowledge Graph Construction

Expert Validation of SKOS thesaurus

Competency Questions and Visualizations

Directory Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages