Skip to content

atalagk/openEHR-Data-Generator

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

openEHR Data Generator

For mere mortals :)

This tool generates synthetic openEHR with variation that does not break the original archetype or template constraints. You can upload operational templates (opt) or your existing canonical compositions — and it produces as many flat or canonical compositions as you want (>10k files will be given as tar.gz ) Useful for testing, demonstrations, or training environments without using real patient data.


Overview

  • Mutation is driven by webtemplate (WT) constraints per rmType, not random guessing
  • Flat format (FLAT) used for generation; canonical format supported for duplication
  • Targets ehrbase via openEHR REST API v1; any openEHR REST API v1 spec-compliant CDR should work
  • Entry point: gen-openehr.py

EHRbase Setup (if needed)

If you don't have access to an openEHR CDR, check /ehrbase folder for docker setup stuff (.env.ehrbase and docker-compose.yml which I improved from original ehrbase distribution; e.g. persistent DB and health checks to containers and more).

  • go into /ehrbase folder
  • run docker compose up -d

And voila - in most cases it should be up and running on: http://localhost:8080/ehrbase/rest/openehr/v1


Modes

Mode 1 — Duplicate

Reads canonical JSON compositions from source_models/user_compositions/, strips UIDs (if any), and posts each one N times to the CDR or saves to dist/compositions/. Use this when you have known-good canonical compositions and want to replicate them. Obviously you should have opt in the CDR (you can use Mode 3 to upload opt)

When saving locally (a), if the total composition count exceeds 10,000 the tool asks:

  e.g. 12,000 compositions to save: (a) Individual files / (b) Zip [default]:
  • Default (Enter or b) → single dist/compositions/compositions.tar.gz (gzip compressed)
  • a → individual .json files as before

Mode 2 — Generate

Reads flat composition skeletons from source_models/flat_composition_skeletons/, applies WT-driven mutation per rmType, and posts or saves the result. Requires Mode 3 to have been run first to populate skeletons and webtemplates.

When saving locally (a), the tool first asks for the output format:

  Format: (a) Flat [default] / (b) Canonical (via AQL):
  • Flat (default, a): saves the mutated flat JSON directly — no CDR connection needed.
  • Canonical (via AQL) (b): posts each flat composition to the CDR, then fetches the canonical JSON back using paginated AQL (SELECT c FROM EHR ... CONTAINS COMPOSITION c LIMIT 10 OFFSET n) and saves the CDR-returned canonical representation. Requires a live CDR connection.

The same tar.gz threshold applies: if total compositions exceed 10,000, a packaging prompt appears (same wording as Mode 1).

Mode 3 — Setup

Full environment preparation in one step:

  1. Clears opt_webtemplates/ and flat_composition_skeletons/
  2. Prompts for ehrbase URL and credentials (saved to ehrbase_config.json)
  3. Uploads all .opt files from source_models/opts/ to the CDR
    • 200/201: extracts template_id from Location header
    • 409 (already exists): extracts template_id from OPT XML body
  4. Fetches and saves webtemplates to source_models/opt_webtemplates/
  5. Fetches flat example compositions per WT and saves envelopes to source_models/flat_composition_skeletons/

Re-running Mode 3 wipes and regenerates all artefacts. Credentials can be updated at this point.


Mutation Rules (Mode 2)

Mutation is applied per WT node rmType. Keys matching protected path segments are always skipped.

rmType Behaviour
DV_QUANTITY ±10% jitter on |magnitude; clamped to WT min/max range; |unit untouched
DV_CODED_TEXT (local) Random pick from WT input code list
DV_CODED_TEXT (openehr) Untouched
DV_TEXT Shuffle words (multi-word); append random hex suffix (single word)
DV_DATE_TIME / DV_DATE / DV_TIME ±15% of one day (86 400 s)
DV_DURATION Untouched
DV_ORDINAL Random pick from WT list; sets |ordinal, |value, |code
DV_COUNT Random integer within WT validation range
null_flavour (mandatory) Injected via WT id path (e.g. element/coded_text_value|code); value keys kept

Protected path segments (any key containing these is skipped entirely): category, context, language, territory, composer, _work_flow_id, _guideline_id, _instruction_details, ism_transition, annotations

ism_transition is fully protected because careflow_step, current_state, and transition are tightly coupled — mutating one without the others produces invalid ISM state machine transitions.


Directory Layout

source_models/
  opts/                        # Input: OPT files to upload
  opt_webtemplates/            # Generated by Mode 3: webtemplate JSONs
  flat_composition_skeletons/  # Generated by Mode 3: flat example envelopes
  user_compositions/           # Input: canonical JSONs for Mode 1
dist/
  compositions/                # Output: generated compositions
ehrbase_config.json            # Saved API credentials (gitignored)
ehrbase/

Flat skeleton files are wrapped in an envelope:

{ "template_id": "...", "flat_comp": { ... } }

Requirements

  • Python 3.10+
  • Running ehrbase (or any openEHR REST API v1 compliant CDR)
  • source_models/opts/ populated with your OPT files before running Mode 3

Setup

1. Create and activate a virtual environment

Windows

py -3.12 -m venv venv
.\venv\Scripts\Activate.ps1

Linux / macOS

python3 -m venv venv
source venv/bin/activate

2. Install dependencies

python -m pip install --upgrade pip
python -m pip install -r requirements.txt

3. Run

python3 gen-openehr.py

Typical Workflow

  1. Place .opt files in source_models/opts/
  2. Run Mode 3 — enter ehrbase URL and credentials once; credentials saved to ehrbase_config.json
  3. Run Mode 2 — choose count per skeleton and destination (local disk or CDR)
  4. Optionally place canonical JSONs in source_models/user_compositions/ and use Mode 1

Notes

  • ehrbase_config.json is gitignored. Re-run Mode 3 to update credentials or URL.
  • Mode 3 wipes opt_webtemplates/ and flat_composition_skeletons/ on every run — any manual edits to skeletons will be lost.
  • dist/compositions/ is wiped at the start of every Mode 1 or Mode 2 local-save run.
  • Concurrency is capped at 10 parallel requests (asyncio semaphore) for all CDR calls.
  • Total elapsed time is always printed on exit: [*] Total time: Xm Ys.
  • Project must be on a local drive; do not store the venv in synced folders (OneDrive, Google Drive).

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%