glossa-lab

Agentic computational linguistics research platform for statistical analysis, decipherment, and hypothesis testing of ancient and unknown writing systems — with a primary focus on the Indus Script (Mahadevan corpus, Holdat LLC dataset) using methods developed by Dr. Andreas Fuls (TU Berlin / ICIT).

Built and maintained by Layer1Labs Silicon, Inc.

Overview

Glossa Lab is a production research tool combining a Python backend, React frontend, and Windows/Linux/macOS service support. It provides an end-to-end environment for:

Corpus management — upload, register, inspect, and sanitise sign-sequence corpora
Statistical analysis — entropy, Zipf, positional profiles (T/I/M), writing-system classification
Decipherment experiments — SA-based sign-to-phoneme hypothesis generation, benchmarks vs known scripts
Experiment Builder — composable graph experiments using atomic nodes (no coding required); new Evidence Graph category with 7 nodes for comparative literature analysis
Study Builder — multi-experiment research workflows as visual graphs
Glossa AI — embedded research assistant that runs analyses, proposes hypotheses, and navigates the tool
Discovery engine — continuous literature discovery across arXiv, EuropePMC, CrossRef, DOAJ and more
Evidence Graph — per-project literature library, automated paper sweep (configurable via sweep.yaml), claim extraction, cross-hypothesis falsification matrix, and hidden hypothesis generation
AI Provider Registry — unified management of cloud (OpenAI, Anthropic, Mistral, Google…), local (Ollama), and self-hosted (vLLM) AI backends with model scoring and smart assignment
Reports & Data — PDF, Markdown, JSON, CSV export of all results

System architecture

[ Tray ] ─────┐
              │
[ Frontend ] ─┼──→ [ Backend Service (FastAPI) ] ──→ [ Pipelines / Jobs / Models ]
              │              │
[ CLI / Dev ] ┘         [ SQLite DB ]
                              │
                    [ Provider Registry ] ──→ [ Cloud / Ollama / vLLM ]

Key principles

The backend is the source of truth
The tray and frontend are interfaces, not runtime owners
All communication occurs through explicit REST APIs
Service lifecycle is deterministic and observable — every background process logs START/COMPLETE

Components

Backend (Python / FastAPI)

REST API + background job engine
SQLite database (providers, model scores, discovery items, experiments, studies)
AI provider registry with test/probe on startup and on-demand
HuggingFace Open LLM Leaderboard sync (nightly) + static fallback scores
Discovery engine with 10+ fetchers (arXiv, EuropePMC, CrossRef, PubMed, DOAJ…)
RAG index for research context injection
Ollama auto-detection and lifecycle management

Frontend (React / TypeScript / Vite)

Built artefact (frontend/dist/) is committed to the repo so the server only needs git pull — no Node.js required on the deployment target.

Key panels:

Provider Registry — add/test/manage AI providers; badges: 🦙 Ollama · ☁️ Cloud · ⚡ vLLM/Custom · 🤗 HuggingFace
Model Assignments — assign primary/fallback models per bucket (Reasoning / Conversational / Long-form / Global) with draft/apply workflow, scores, filter, and swap
Experiment Builder — visual DAG editor with Evidence Graph palette category (7 nodes)
Study Builder — multi-experiment research workflows (accessible via Projects)
Discovery View — literature feed with → Evidence import action for Indus/Harappan items
Evidence Graph — three-tab workspace: Library (PDF upload, URL import), Claims (filterable), Sweep (configurable sweep + candidate import)
Foundation Check — research integrity dashboard (17 checks; must be PASS before external communication)
Bottom Panel — structured Logs (JSON → human-readable), Jobs, Terminal

Tray (Windows/macOS)

Local control surface. Start/stop/restart backend, open UI, quick status.

Agent-Stack (layer1labs server — separate repo)

Three vLLM services on NVIDIA RTX PRO 5000 Blackwell (48 GB GDDR7):

l1-nexus (port 8000) — cpatonn/Qwen3-Coder-30B-A3B-Instruct-AWQ-4bit — primary coding/agentic
l1-glossa (port 8001) — Qwen/Qwen3-14B — research/long-context reasoning
l1-embed (port 8002) — BAAI/bge-m3 — embeddings (RAG)

Access via Tailscale (100.118.107.3). Repo: layer1labs/agent-stack.

Repository structure

glossa-lab/
├─ AGENTS.md            ← agent operating rules (read first, every session)
├─ LEDGER.md            ← session ledger (sole continuity authority)
├─ README.md
├─ CITATIONS.md         ← citation registry for all research data
├─ setup-os.cmd         ← canonical start/stop/restart (Windows)
├─ shell.cmd            ← tool wrapper (pytest, ruff, python — Windows)
├─ shell.sh             ← tool wrapper (Linux/macOS)
├─ .github/
│  └─ workflows/ci.yml  ← GitHub Actions CI (pytest + Playwright + evidence scripts)
├─ backend/
│  ├─ glossa_lab/       ← FastAPI app + all Python modules
│  │  ├─ api/           ← REST route modules
│  │  │  └─ indus_evidence.py ← Evidence Graph API (library, claims, sweep)
│  │  ├─ experiments/   ← ExperimentBase subclasses + graph JSONs
│  │  ├─ experiment_graph_indus_evidence.py ← 7 Evidence Graph atomic nodes
│  │  ├─ discovery/     ← literature discovery engine + fetchers
│  │  ├─ data/          ← corpora, anchor sets, LM files (cited per H18)
│  │  └─ model_intelligence.py ← HF leaderboard sync + scoring
│  ├─ reports/          ← experiment results, phase syntheses
│  ├─ data/          ← corpus data files (indus_cisi_corpus.json, dravidian_tamil_lm.json, ...)
│  └─ scripts/          ← utility and research scripts (phase44_*.py, build_*.py, ...)
├─ frontend/
│  ├─ src/              ← React source
│  │  └─ components/IndusEvidenceView.tsx ← Evidence Graph three-tab workspace
│  └─ dist/             ← built artefact (committed for server deploy)
├─ glossa-indus/        ← Indus Evidence Graph data store
│  ├─ config/sweep.yaml ← per-project sweep configuration (editable)
│  ├─ literature/       ← registered papers (JSON metadata)
│  ├─ claims/           ← extracted claims per document
│  ├─ hypotheses/       ← hypothesis model YAMLs
│  ├─ raw/user_uploads/ ← user-uploaded PDFs
│  └─ scripts/          ← intake + claims extraction pipeline
├─ tray/                ← system tray app
├─ docs/
│  ├─ USER_GUIDE.md
│  ├─ user-manual.md
│  ├─ architecture.md
│  └─ research/         ← decipherment research docs
├─ services/            ← systemd/launchd/Windows service definitions
└─ corpora/             ← external corpus downloads (gitignored, ~3 GB)

Quick start

Windows

# First-time install (registers autostart, installs deps)
setup-os.cmd install

# Start backend + tray
setup-os.cmd start

# Verify
curl.exe -sf http://localhost:8001/api/v1/health

Linux (systemd)

cd backend && python3 -m venv venv && venv/bin/pip install -e .
sudo systemctl start glossa-lab
curl -sf http://localhost:8001/api/v1/health

Open http://localhost:8001 in your browser.

Development workflow

All non-trivial work follows the proposal-first cycle in AGENTS.md. Frontend changes require a rebuild before they are visible:

cd frontend && npm run build
# Verify served bundle:
curl.exe -sf http://localhost:8001/ | Select-String 'index-[A-Za-z0-9]+\.js'

Research governance

H18 — Every data file must have _citation traceable to CITATIONS.md
H19 — Foundation check must PASS before external communication
Current: 13 PASS (archived) / 0 FAIL (GET /api/v1/research/foundation-check)
V8-V24 decipherment campaign archived 2026-05-17; INDUS_FINAL_ANCHORS.json (137 anchors) preserved

Documentation

File	Purpose
`AGENTS.md`	Agent rules, start/stop commands, hard rules
`LEDGER.md`	Session ledger — sole continuity authority
`CITATIONS.md`	Research data citation registry
`docs/USER_GUIDE.md`	Full user guide (all panels) including Evidence Graph
`docs/architecture.md`	System architecture including Evidence Graph layer
`docs/REQUIREMENTS.md`	Formal requirements (R1–R16, incl. R14 Evidence Graph, R15 DB reliability, R16 CI/CD)
`docs/TEST_SPEC.md`	Test specification (TEST-IEA, TEST-EV, TEST-PW-EG, TEST-CI)
`docs/research/`	Decipherment research documents
`docs/guides/`	How-to guides (experiments, pipelines, studies)
`glossa-indus/LEDGER.md`	Evidence Graph batch work log

Current research status (May 2026 — Phase-44)

137 verified anchors (7 HIGH, 54 MEDIUM, 75 LOW; 196 nīr-placeholder entries from V8-V24 archive removed)
7 HIGH-confidence readings: M342=ay/ā, M176=an/aṇ, M099=kol/koḷ, M062=erutu, M045=yānai, M016=kaḷiṟu, M006=puli
CISI corpus rebuilt: 179 inscriptions / 1003 tokens / 182 distinct signs
Dravidian LM expanded: 944 bigrams (from 184) via TamilTB v0.1 integration
Phase-44 T1 (M342 genitive): UNCERTAIN — cross-site Jaccard 0.429; anchor signs confirmed in genitive context
Phase-44 T2 (M99 phonetic): SUPPORTED — kol/koḷ (DEDR 2173/2174); M267→M099 title formula 84×
Phase-43 (May 2026): 231.9σ positional structure confirmed; Hunt tripartite formula 59× lift
Evidence Graph (May 2026): 11 papers registered, 22 claims extracted across Parpola/FSW/Yadav/Roif/Hunt
TB correlation: 0.907 (post M267 correction)
V8-V24 autonomous campaign archived 2026-05-17; INDUS_FINAL_ANCHORS.json preserved at 137 entries

Status

Production — active research. Backend and frontend fully operational at http://localhost:8001.

Name		Name	Last commit message	Last commit date
Latest commit History 533 Commits
.github		.github
analysis		analysis
backend		backend
communications		communications
crosswalks		crosswalks
data-import		data-import
data_normalized		data_normalized
data_raw		data_raw
docs		docs
frontend		frontend
glossa-corpus		glossa-corpus
glossa-indus		glossa-indus
ip		ip
predictions		predictions
publication		publication
reports		reports
scripts		scripts
services		services
snapshots		snapshots
tools		tools
tray		tray
validation		validation
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENT.md		AGENT.md
AGENTS.md		AGENTS.md
CITATIONS.md		CITATIONS.md
LEDGER.md		LEDGER.md
LICENSE		LICENSE
README.md		README.md
analyze_fuls_ebooks.py		analyze_fuls_ebooks.py
generate_fuls_report.py		generate_fuls_report.py
ocr_mahadevan.py		ocr_mahadevan.py
process_paste.py		process_paste.py
scaffold.yml		scaffold.yml
setup-os.cmd		setup-os.cmd
setup-os.sh		setup-os.sh
shell.cmd		shell.cmd
shell.sh		shell.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

glossa-lab

Overview

System architecture

Key principles

Components

Backend (Python / FastAPI)

Frontend (React / TypeScript / Vite)

Tray (Windows/macOS)

Agent-Stack (layer1labs server — separate repo)

Repository structure

Quick start

Windows

Linux (systemd)

Development workflow

Research governance

Documentation

Current research status (May 2026 — Phase-44)

Status

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

glossa-lab

Overview

System architecture

Key principles

Components

Backend (Python / FastAPI)

Frontend (React / TypeScript / Vite)

Tray (Windows/macOS)

Agent-Stack (layer1labs server — separate repo)

Repository structure

Quick start

Windows

Linux (systemd)

Development workflow

Research governance

Documentation

Current research status (May 2026 — Phase-44)

Status

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages