An Agent Skill and reference repository for generating realistic synthetic datasets for tests, demos, sandboxes, seed data, and workflow simulations.
The skill translates user requirements into a validated generation config, then runs a bundled Python CLI to produce CSV, JSON, NDJSON, SQL, HTML, or Markdown output. It supports locale-aware Faker providers, curated regex-backed custom formats, seeded repeatability, Belgian-specific identifiers such as INSZ and eID, optional population-shaping through weighted distribution segments, simple schema-driven SQL generation from CREATE TABLE statements, and human-readable persona bundles for reviewer-friendly browsing.
This repository includes both:
- the runtime-critical skill assets used by an AI agent
- the GitHub-facing documentation, examples, tests, and contribution materials needed to maintain the skill professionally
Core runtime assets:
SKILL.mdagents/openai.yamlscripts/generate_data.pyreferences/
Support assets for maintainers and adopters:
examples/tests/evals/CONTRIBUTING.md
This skill is responsible for:
- turning user requirements into a structured synthetic-data config
- validating that config before generation
- generating realistic but fake records to disk
- previewing results in a machine-friendly and agent-friendly way
- documenting supported formats, examples, and extension paths
This skill is not responsible for:
- anonymizing real production datasets
- scraping or cloning live public registries into output files
- maintaining cross-agent shared memory infrastructure
- automatically self-modifying its references without explicit maintainer intent
- Versioned config model with JSON schema reference
- Seeded deterministic runs when repeatability matters
- CSV, JSON, NDJSON, SQL, HTML, and Markdown output
- Structured CLI result summary with preview rows
- Optional population-representativeness layer with weighted segments and subset filters
- Live source-query mode for deriving segments from public datasets at generation time
- Summary reporting for what is and is not distribution-backed
- Segment-aware field types for consistent values such as sex-aligned first names and age-backed birth dates
- Bundled Belgian address catalog support for coherent
street_address+postcode+citygeneration - Simple SQL schema parsing from
CREATE TABLEDDL when you want SQLINSERToutput - Curated regex-backed custom formats
- Belgian-specific synthetic identifiers
- Example configs for common scenarios
- Tests and trigger-eval prompts for ongoing maintenance
Add the skill with a compatible Agent Skills client:
npx skills add jovd83/lifelike-synthetic-data-generatorOr clone it into a local skills directory:
git clone https://github.com/jovd83/lifelike-synthetic-data-generator.gitTypical local locations include:
~/.agents/skills/~/.cursor/skills/- tool-specific local skill directories supported by your agent platform
Install the Python dependencies:
python -m pip install -r scripts/requirements.txtValidate an example config:
python scripts/generate_data.py --config examples/people-belgium.json --validate-onlyexamples/people-belgium.json now demonstrates coherent Belgian address tuples by keeping street, postcode, and city aligned through belgian_address_component.
Translate a plain-language persona request into a runnable config:
python scripts/translate_persona_request.py --request examples/persona-request-belgium.json --output artifacts/persona-request-belgium.generated.jsonThis translator remains Belgium-first. It can resolve the Belgian target locale from the request language, but it does not yet generate non-Belgian persona configs.
Generate the dataset:
python scripts/generate_data.py --config examples/people-belgium.jsonThe script writes the output file defined in the config and prints a JSON summary to stdout with a preview of generated rows.
Generate SQL seed data from a schema-driven config:
python scripts/generate_data.py --config examples/people-belgium-sql.jsonGenerate a browsable persona bundle after setting output.format to html or markdown in the config:
python scripts/generate_data.py --config your-persona-config.jsonexamples/persona-belgium.json and examples/persona-belgium-html.json are the richer reference examples for nested personas, Belgian contact details, and correlation-driven profile sections.
When output.format is html or markdown, the generator writes a bundle directory with an index plus one file per persona. HTML bundles include index.html with a table, a short description for each persona, and direct links to the individual persona pages.
Persona bundles now use the config locale for page language metadata and basic UI copy. By default, the bundle renderer also omits sensitive-looking contact, banking, and identifier fields so reviewer-facing pages read like personas instead of raw record dumps. Set output.include_sensitive_fields to true only when you intentionally want a full QA-style bundle.
Optional persona bundle output keys:
output.title: override the default bundle titleoutput.include_sensitive_fields: include full synthetic identifiers and contact/banking details in HTML or Markdown bundles
The preferred config format is versioned and explicit:
{
"version": "1.0",
"locale": "nl_BE",
"records": 10,
"seed": 42,
"output": {
"format": "csv",
"path": "artifacts/people-belgium.csv"
},
"fields": [
{ "name": "first_name", "type": "first_name" },
{ "name": "last_name", "type": "last_name" },
{ "name": "insz", "type": "belgian_insz" }
]
}Reference assets:
- field catalog:
references/field-types.md - persona template:
references/persona-template.md - persona catalogs:
references/persona_catalogs.json - persona profile bundles:
references/persona_profile_bundles.json - persona archetypes:
references/persona_archetypes.json - Belgian address catalog:
references/belgian_address_catalog.json - config schema:
references/schema-config.schema.json - reusable regex formats:
references/custom_formats.json - representativeness workflow:
references/population-modeling.md - source catalog:
references/open_data_sources.json
Representative datasets can also include a population_model block:
{
"population_model": {
"scope": {
"country": "BE",
"level": "nuts3",
"code": "BE100",
"reference_year": 2023
},
"filters": {
"sex": ["F"]
},
"dimensions": [
{ "name": "sex" },
{ "name": "age_band" }
],
"segments": [
{
"weight": 0.42,
"values": {
"sex": "F",
"age_band": "Y18T44"
}
}
]
}
}That model can be as small or as large as the user needs. If the user only cares about sex balance, model only sex. If they also care about age, geography, education, or income, add only those dimensions and report the resulting coverage explicitly.
When a supported public source is available, population_model can use source_query instead of hard-coded segments. See examples/people-brussels-representative-live.json for a live Statbel-backed example.
Legacy config keys (output_format, output_file) are still accepted for backward compatibility.
For Belgian datasets that need coherent address tuples rather than independently faked street and city values, use belgian_address_component.
The generator samples from the bundled references/belgian_address_catalog.json catalog and keeps related fields aligned within each row. That means street_address, postcode, city, province, and region can come from the same sampled Belgian locality profile.
Supported params for belgian_address_component:
component: one ofstreet_address,postcode,city,province, orregionprofile: optional row-level cache key so multiple address fields reuse the same sampled addressregion: optional fixed region filter such asVLG,WAL, orBXLregion_segment_key: optional population-segment key to read the region from dynamicallyprovince: optional province filterpostcode_prefix: optional postcode prefix filter
Example:
{
"fields": [
{ "name": "region", "type": "segment_value", "params": { "key": "region" } },
{
"name": "street_address",
"type": "belgian_address_component",
"params": {
"profile": "home_address",
"region_segment_key": "region",
"component": "street_address"
}
},
{
"name": "postcode",
"type": "belgian_address_component",
"params": {
"profile": "home_address",
"region_segment_key": "region",
"component": "postcode"
}
},
{
"name": "city",
"type": "belgian_address_component",
"params": {
"profile": "home_address",
"region_segment_key": "region",
"component": "city"
}
}
]
}Use ordinary Faker street_address when you only need lifelike text. Use belgian_address_component when the Belgian address fields need to stay internally consistent.
If you know the target table schema, the skill can now generate SQL INSERT output directly.
Two workflows are supported:
- Explicit field mapping:
Keep using
fieldsand setoutput.formattosql. - Schema-derived fields:
Provide
sql_schema.ddlorsql_schema.ddl_pathwith a simpleCREATE TABLEstatement and let the skill derive fields automatically.
Example:
{
"records": 6,
"locale": "nl_BE",
"seed": 42,
"sql_schema": {
"ddl": "CREATE TABLE customer_profiles (first_name VARCHAR(100), last_name VARCHAR(100), email VARCHAR(255), postal_code VARCHAR(10), mobile_phone VARCHAR(20), active BOOLEAN, loyalty_points INTEGER);"
},
"output": {
"format": "sql",
"path": "artifacts/customer_profiles.sql"
}
}This parser is intentionally simple. It is designed for straightforward CREATE TABLE statements and common column types, not full vendor-specific SQL dialect coverage.
The skill can consult or be configured against the following curated public sources. In practice, they serve different purposes: some are strong enough to shape representative distributions, while others are better for locality realism, geography, or address-like formatting.
| Source | Scope | Representative strength | Best for | Typical use in this skill |
|---|---|---|---|---|
Statbel Open Data API |
Belgium | Primary | Age, sex, geography, nationality, education, employment, unemployment, income quintiles, occupation status | First choice for Belgium-specific distribution-backed synthetic datasets |
Statbel Open Data Files and Geographic Downloads |
Belgium | Supporting | Population grids, density-aware geography, statistical sectors, NUTS mappings, REFNIS normalization | Add spatial spread and Belgian code/geography consistency |
BOSA BeST Address |
Belgium | Supporting | Official Belgian street, postcode, city, province, and region combinations | Source used to build the bundled references/belgian_address_catalog.json catalog for belgian_address_component |
Eurostat |
EU | Primary | Cross-country comparability, regional education, labour-market mix, income and social indicators | Use when the target scope is EU-wide or cross-country rather than only Belgium |
Belgian Open Data Portal |
Belgium | Supporting | Discovery of Belgian municipal, locality, mobility, environment, and administrative datasets | Find supporting Belgian open data beyond Statbel, then combine with stronger statistics if needed |
WorldPop |
Global | Supporting | Gridded population density, spatial weighting, subnational spread | Make dense cities more likely than sparse rural areas and support realistic locality weighting |
GeoNames |
Global | Supporting | Place names, locality vocabularies, geographic hierarchies, admin-code normalization | Improve locality realism and hierarchy consistency |
OpenAddresses |
Global | Supporting | Address formatting, street-level vocabulary | Make addresses look realistic without claiming population representativeness |
World Bank Data |
Global | Supporting | Macro-economic context, national demographic context | Add country-level context, not person-level demographic sampling |
Statbel Open Data API: strongest current source for Belgian distribution-backed dimensions.Statbel Open Data Files and Geographic Downloads: especially helpful when postcode-level or grid-level density matters.BOSA BeST Address: best current source for coherent Belgian address tuples; the runtime uses a bundled catalog derived from its official exports.Eurostat: strong for region-level and country-level distributions, weaker for street or postcode realism.Belgian Open Data Portal: mainly a catalog-discovery surface rather than a direct representative population source.WorldPop: strong for spatial weighting, not a replacement for official demographic distributions.GeoNames: strong for hierarchical place realism, not enough on its own for population-weighted sampling.OpenAddresses: useful for structural address realism, not for proving density.World Bank Data: useful for macro context, not direct synthetic person-level representativeness.
- If the request is Belgian and distribution-backed: start with
Statbel Open Data API. - If the request needs coherent Belgian addresses: prefer
belgian_address_componentbacked by the bundled catalog derived fromBOSA BeST Address. - If the request needs EU comparability: use
Eurostat. - If the request needs dense-city versus rural spread: add
WorldPopor Statbel geographic files. - If the request needs realistic place names or hierarchy mappings: add
GeoNames. - If the request needs realistic-looking addresses outside that Belgian catalog workflow: add
OpenAddressesand/ordata.gov.bediscoveries. - If the request only needs a few represented dimensions: only model those dimensions to save tokens and complexity.
- Always report which dimensions are truly distribution-backed and which remain only lifelike.
agents/
openai.yaml
evals/
trigger-eval-queries.json
examples/
organizations-us.json
people-belgium.json
persona-belgium.json
people-brussels-representative.json
people-brussels-representative-live.json
references/
belgian_address_catalog.json
custom_formats.json
field-types.md
open_data_monitoring.json
open_data_sources.json
population-modeling.md
schema-config.schema.json
scripts/
check_open_data_updates.py
generate_data.py
refresh_open_data_monitoring.py
requirements.txt
tests/
test_belgium_evals.py
test_generate_data.py
test_skill_creator_eval_export.py
CHANGELOG.md
CONTRIBUTING.md
README.md
SKILL.md
Validation and maintenance surfaces included in this repository:
tests/test_generate_data.py: CLI and generation regression coveragetests/test_belgium_evals.py: Belgium-focused evaluation harness coverageevals/trigger-eval-queries.json: realistic prompts for skill-trigger evaluationevals/belgium_experiments.py: fifteen Belgium-focused realism and statistical experiments, including zipcode, degree, occupation, car-brand, and exact-age casesexamples/: representative configs for manual smoke testing
Run the tests:
python -m unittest discover -s tests -vGenerate the Belgium HTML evaluation bundle:
python scripts/run_belgium_evals.pyExport the experiment suite into the skill-creator evaluation workspace and render the skill-creator static review HTML:
python scripts/export_skill_creator_eval_review.pyCheck whether new public Statbel datasets or schema changes appeared since the stored snapshot:
python scripts/check_open_data_updates.py --source-id statbel-open-data-api
python scripts/check_open_data_updates.py --source-id data-gov-be
python scripts/check_open_data_updates.py --source-id eurostat-api
python scripts/check_open_data_updates.py --source-id geonames
python scripts/check_open_data_updates.py --source-id worldpop
python scripts/check_open_data_updates.py --source-id world-bank-dataRefresh the stored monitoring baseline from the live official catalogs:
python scripts/refresh_open_data_monitoring.pyThe curated source list in references/open_data_sources.json is an optional realism aid for maintainers. It is meant to guide distribution shaping and source selection, not to imply that live ingestion is fully automated for every dataset.
Cross-agent memory is intentionally out of scope for this repository. If you need that capability, integrate with a dedicated shared-memory skill rather than expanding this skill into infrastructure it does not own.
This repository now includes references/persona-template.md, a maintainer-facing template for extending the skill from flat person rows toward richer synthetic personas with:
- identity and household members
- an introduction paragraph
- professional, lifestyle, digital, and finance sections
- optional health fields
- a longer biography
The template is intentionally a design contract first. It helps capture user wishes and define a coherent output shape without claiming that all persona-specific narrative and family-generation features are already native in scripts/generate_data.py.
The runtime now also includes foundational persona-oriented field types:
objectfor nested JSON sections such asidentityorcontactarrayfor repeated nested values such as children or preference liststemplatefor derived narrative strings such as introductions and biographiesage_from_birth_datefor keeping exact age aligned with a generated birth date- field-level
whenconditions for optional sections such as spouse, children, or vehicle details - top-level
correlation_rulesfor aligning related traits such as income, housing, and mobility - source-backed
correlation_rules.source_modelfor grounding lifestyle traits in copied or live source-derived segments - top-level
contradiction_checksfor rejecting unrealistic combinations
See examples/persona-belgium.json for a runnable nested persona example.