Tags on GitHub repos (topics) and Hugging Face repos (card metadata) are free-form text, so
the same concept often appears under multiple spellings: computer-vision, computer vision,
cv. Tag groups normalize this noise so the catalog filter dropdown shows one clean canonical
tag instead of a dozen near-duplicates.
Tag groups live in public/tag-groups.js as a plain JavaScript object:
const TAG_GROUPS = {
"canonical tag": ["raw-tag-1", "raw tag 2", "raw_tag_3"],
...
};- The key (
"canonical tag") is the display tag shown in the UI filter dropdown. - The value array lists every raw API tag that should be normalized to that key.
- Raw tags not present in any array pass through unchanged and appear as-is in the UI.
- Raw tags that contain a colon (e.g.
license:mit,format:parquet) are automatically filtered out as Hugging Face system metadata so they never reach the UI. This can be changed in main.js, in thenormalizeTagfunction. - Raw tags are maintained and matched-against for keyword searching and do appear in repo cards.
When first setting up a catalog from this template you will need to build your custom, public/tag-groups.js.
For the weekly scan to be useful, you need to build an
initial set of tag groups from your organization's existing tags.
Step 1 — Generate the full tag list:
node scripts/export-tags.jsThis writes every raw tag currently used across your GitHub and Hugging Face repos to
scripts/tag-export.txt. Commit that file to main, it becomes the baseline all future
weekly diffs compare against.
Step 2 — Build the initial tag-groups.js:
With a fresh catalog you'll likely have dozens or hundreds of raw tags to group all at once.
Our experience: feeding the full tag-export.txt into an AI assistant (we used Claude's Sonnet 4.6)
produced a solid first draft of canonical groups quickly. However, it needed significant
manual review - the AI tended to miss obvious groupings like singular/plural pairs (image and
images) and sometimes invented canonical names that didn't match how
the tags were actually used.
Our recommendation: use AI output as a springboard, not a final answer. Let it handle the bulk of the structural work, then go through the result yourself with the raw tag list open side-by-side. Pay particular attention to:
- Singular/plural variants (
specimen/specimens,annotation/annotations) - Hyphenated vs. space-separated forms (
animal-behavior/animal behavior) - Acronyms and their expansions (
cv/computer vision,xai/explainable ai) - Closely related concepts that warrant separate canonical tags vs. one broader group
- Any domain-specific or specialized terms. In our case, this was the Hawaiian birds and plants that Claude did not recognize.
The manual pass by those familiar with the terms provides the most value, but is facilitated by the initial grouping.
A GitHub Actions workflow runs every Monday at 08:00 UTC. It fetches all current raw tags from
the GitHub and Hugging Face APIs, diffs them against the committed baseline in
scripts/tag-export.txt, and opens (or updates) a pull request titled
[Tag Scan] New tags detected — review tag-groups.js whenever 5 or more new tags appear.
Prerequisite: The workflow requires a fine-grained PAT stored as a repository secret named
PR_TOKENwith Pull requests: Read and write permission. Without it, the workflow will find new tags and push the branch successfully, but fail when attempting to open the PR.
You should update public/tag-groups.js when that PR is opened or updated.
-
Open the Tag Scan PR — find it in the repo's Pull Requests tab (title starts with
[Tag Scan]). The PR body lists every raw tag that is new relative toscripts/tag-export.txt. -
Open
public/tag-groups.jsin an editor. -
For each new tag in the list, decide:
Situation Action It's a variant of an existing concept (e.g. bird-detectionwhenobject detectionexists)Add the raw tag to the existing group's array It represents a new concept that groups with other existing or new tags Add a new key + array, under the appropriate section comment It's a Hugging Face system tag (contains a colon, e.g. license:mit)No edit - these are filtered out by the app and will not reach the UI (see above for how to not filter these out) It's a one-off with no grouping candidates No edit - it passes through unchanged and appears as-is in the UI -
Commit the updated
tag-groups.jsto the PR branch (tag-scan/auto). -
Merge the PR once all tags are addressed.
To generate a fresh snapshot of all current raw tags outside of the weekly workflow:
node scripts/export-tags.jsOutput is written to scripts/tag-export.txt (one tag per line, sorted, deduplicated).
This is also the baseline file the weekly workflow diffs against, so if you run the script and
commit the result to main, subsequent scans will compare against that new baseline.