Skip to content

Support for UK GEMINI 2.3 and MEDIN 3.1.2 metadata profiles #1224

@KoalaGeo

Description

@KoalaGeo

Summary

I'd like to propose adding support to pycsw for two UK metadata profiles that are widely used in UK government and research catalogues but not currently covered:

  • UK GEMINI 2.3 — the [Association for Geographic Information][agi]'s Geo-spatial Metadata Interoperability Initiative; the UK government's
    recommended standard for describing geographic data, used across data.gov.uk, Defra Data Services Platform, Natural England, BGS, EIDC, etc.
  • MEDIN 3.1.2 — the [Marine Environmental Data and Information Network][medin]'s discovery metadata standard, used by the MEDIN portal
    and the UK marine Data Archive Centres (BODC, BGS, UKHO, MEDIN, etc.) MEDIN is explicitly a marine profile of GEMINI 2.3.

Both are technically constraint-based profiles of ISO 19115/19139, which is already supported in pycsw via the apiso profile. Both are validated via
Schematron rather than by extending the XSD. Both are published under [CC-BY 4.0][ccby].

Before opening any code PRs I'd like to agree the shape with maintainers, because the contribution touches the profile-plugin registration logic.

Background: how GEMINI and MEDIN relate to ISO 19139

Both standards:

  • Use the existing ISO 19139 XML schema and namespace (http://www.isotc211.org/2005/gmd). MEDIN's [MedinMetadataProfile_v3.1.2.xsd][medin-xsd] is a one-line wrapper that just xs:includes the gmd application schema and uses targetNamespace="http://www.isotc211.org/2005/gmd" — i.e. it adds nothing to the schema. GEMINI ships no XSD at all.
  • Are rooted at <gmd:MD_Metadata>, i.e. they share apiso's typename.
  • Are enforced entirely via Schematron rules layered on top of ISO 19139:
    • [GEMINI 2.3 Schematron][gemini-sch] (CC-BY 4.0, by BGS under contract to AGI)
    • [MEDIN 3.1.2 Schematron][medin-sch] (CC-BY 4.0, by SeaZone Solutions for MEDIN)
  • Stack: MEDIN explicitly maps onto and tightens GEMINI (the MEDIN repo ships
    a MEDIN_3.1.2_GEMINI_2.3_INSPIRE_Mapping.xlsx), which in turn tightens
    INSPIRE / ISO 19139.
    Structurally this is similar to the optional INSPIRE extension already inside apiso, toggled by config['metadata']['inspire']['enabled'].

The architectural question

Because GEMINI, MEDIN, and apiso all share the same XML namespace and the same typename (gmd:MD_Metadata), they cannot cleanly coexist as three independent Profile subclasses without revisiting the registration logic.

Profile.__init__ does:

model['typenames'][self.typename] = self.repository

which overwrites on each load — so whichever profile is listed last in server.profiles: wins for typename-based dispatch.

CSW clients differentiate profiles by outputSchema, not typename. The practical question is therefore: what URI does each UK profile advertise as
its outputSchema?

  • ISO 19139 / apiso uses http://www.isotc211.org/2005/gmd.
  • GEMINI does not formally define one. There's prior art in some UK
    catalogues using https://www.agi.org.uk/gemini/2.3 or similar; happy to
    coordinate with AGI for a blessed URI if useful.
  • MEDIN similarly does not define one; https://medin.org.uk/discovery-metadata/3.1.2
    would be a candidate.

Design options

I see three viable shapes:

A. Two independent sibling profiles (ukgemini, medin), each mirroring iso19115p3.py's structure (~600 lines each).

  • Pros: lowest risk to existing code; reviewable independently; matches the existing precedent of how iso19115p3 was added.
  • Cons: substantial duplication of apiso's queryables — 90%+ of the gmd XPath mappings are identical across all three. Doesn't solve the typename
    collision.
    B. A lightweight intermediate base (e.g. iso19139_constrained) that apiso, ukgemini, and medin can compose from. The differences
    (outputSchema URI, additional queryables, schematron file(s), extended- capabilities content) become subclass overrides.
  • Pros: removes duplication; gives a clean home for future ISO-19139 national profiles (Spain NEM, Germany GDI-DE, Australia ANZLIC, etc.).
  • Cons: meaningful change to apiso internals; needs careful migration to avoid breaking existing apiso deployments.
    C. A schematron-validation extension inside apiso, modelled on the existing INSPIRE block. Configurable via something like
    config['metadata']['profiles']['ukgemini'] / ['medin'] toggles that load the relevant .sch files for transactional validation but reuse apiso's
    outputSchema and queryables.
  • Pros: smallest, least invasive PR.
  • Cons: doesn't surface GEMINI/MEDIN as discoverable profiles in GetCapabilities; reduces value for CSW clients that negotiate by
    outputSchema.

I'd lean toward B done lightweight, staged across three PRs:

  1. Refactor: introduce the small base / extension mechanism with apiso migrated to it as a no-op (existing apiso behaviour preserved, all
    existing apiso tests still pass).
  2. Add ukgemini as the first new consumer, with Schematron validation and the AGI sample records as test fixtures.
  3. Add medin, sharing the GEMINI machinery and adding MEDIN-specific Schematron rules plus a small set of MEDIN-specific queryables (notably
    vertical extent, which MEDIN makes mandatory).

But I'd very much welcome a steer before committing to a shape — happy to do A or C instead if that's preferred.

Specific questions for maintainers

  1. Preferred design option (A, B, C, or something else)?
  2. Synthetic outputSchema URIs — is there an established convention in pycsw for profiles whose source standard doesn't formally define one?
  3. Schematron at runtime — pycsw already depends on lxml, which provides etree.Schematron. Acceptable to validate via lxml on transactional
    inserts / harvest, or is there a preferred validation hook?
  4. Bundled Schematron files — both upstream Schematrons are CC-BY 4.0. compatible with MIT redistribution provided attribution is preserved.
    Acceptable to bundle them under pycsw/plugins/profiles/<name>/schemas/, or should they be optional downloads at deploy time? My preference would be to bundle, with a NOTICE file recording attribution.
  5. write_record synthesis — should the GEMINI/MEDIN profiles re-synthesize records from queryables for non-full esn (as apiso
    does), or only echo stored XML? GEMINI/MEDIN have constraints on gmd:metadataStandardName / gmd:metadataStandardVersion that synthesis would need to respect, and a few mandatory elements (e.g. MEDIN vertical extent) that don't currently live in pycsw's core mappings.

Scope of this issue

This issue is to agree the shape only. Once there's a steer on the questions above I'll open the implementing PR(s) and link them back here.

References

pycsw

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions