Agentic computational linguistics research platform for statistical analysis, decipherment, and hypothesis testing of ancient and unknown writing systems — with a primary focus on the Indus Script (Mahadevan corpus, Holdat LLC dataset) using methods developed by Dr. Andreas Fuls (TU Berlin / ICIT).
Built and maintained by Layer1Labs Silicon, Inc.
Glossa Lab is a production research tool combining a Python backend, React frontend, and Windows/Linux/macOS service support. It provides an end-to-end environment for:
- Corpus management — upload, register, inspect, and sanitise sign-sequence corpora
- Statistical analysis — entropy, Zipf, positional profiles (T/I/M), writing-system classification
- Decipherment experiments — SA-based sign-to-phoneme hypothesis generation, benchmarks vs known scripts
- Experiment Builder — composable graph experiments using atomic nodes (no coding required); new Evidence Graph category with 7 nodes for comparative literature analysis
- Study Builder — multi-experiment research workflows as visual graphs
- Glossa AI — embedded research assistant that runs analyses, proposes hypotheses, and navigates the tool
- Discovery engine — continuous literature discovery across arXiv, EuropePMC, CrossRef, DOAJ and more
- Evidence Graph — per-project literature library, automated paper sweep (configurable via
sweep.yaml), claim extraction, cross-hypothesis falsification matrix, and hidden hypothesis generation - AI Provider Registry — unified management of cloud (OpenAI, Anthropic, Mistral, Google…), local (Ollama), and self-hosted (vLLM) AI backends with model scoring and smart assignment
- Reports & Data — PDF, Markdown, JSON, CSV export of all results
[ Tray ] ─────┐
│
[ Frontend ] ─┼──→ [ Backend Service (FastAPI) ] ──→ [ Pipelines / Jobs / Models ]
│ │
[ CLI / Dev ] ┘ [ SQLite DB ]
│
[ Provider Registry ] ──→ [ Cloud / Ollama / vLLM ]
- The backend is the source of truth
- The tray and frontend are interfaces, not runtime owners
- All communication occurs through explicit REST APIs
- Service lifecycle is deterministic and observable — every background process logs START/COMPLETE
- REST API + background job engine
- SQLite database (providers, model scores, discovery items, experiments, studies)
- AI provider registry with test/probe on startup and on-demand
- HuggingFace Open LLM Leaderboard sync (nightly) + static fallback scores
- Discovery engine with 10+ fetchers (arXiv, EuropePMC, CrossRef, PubMed, DOAJ…)
- RAG index for research context injection
- Ollama auto-detection and lifecycle management
Built artefact (frontend/dist/) is committed to the repo so the server only needs git pull — no Node.js required on the deployment target.
Key panels:
- Provider Registry — add/test/manage AI providers; badges: 🦙 Ollama · ☁️ Cloud · ⚡ vLLM/Custom · 🤗 HuggingFace
- Model Assignments — assign primary/fallback models per bucket (Reasoning / Conversational / Long-form / Global) with draft/apply workflow, scores, filter, and swap
- Experiment Builder — visual DAG editor with
Evidence Graphpalette category (7 nodes) - Study Builder — multi-experiment research workflows (accessible via Projects)
- Discovery View — literature feed with
→ Evidenceimport action for Indus/Harappan items - Evidence Graph — three-tab workspace: Library (PDF upload, URL import), Claims (filterable), Sweep (configurable sweep + candidate import)
- Foundation Check — research integrity dashboard (17 checks; must be PASS before external communication)
- Bottom Panel — structured Logs (JSON → human-readable), Jobs, Terminal
Local control surface. Start/stop/restart backend, open UI, quick status.
Three vLLM services on NVIDIA RTX PRO 5000 Blackwell (48 GB GDDR7):
- l1-nexus (port 8000) —
cpatonn/Qwen3-Coder-30B-A3B-Instruct-AWQ-4bit— primary coding/agentic - l1-glossa (port 8001) —
Qwen/Qwen3-14B— research/long-context reasoning - l1-embed (port 8002) —
BAAI/bge-m3— embeddings (RAG)
Access via Tailscale (100.118.107.3). Repo: layer1labs/agent-stack.
glossa-lab/
├─ AGENTS.md ← agent operating rules (read first, every session)
├─ LEDGER.md ← session ledger (sole continuity authority)
├─ README.md
├─ CITATIONS.md ← citation registry for all research data
├─ setup-os.cmd ← canonical start/stop/restart (Windows)
├─ shell.cmd ← tool wrapper (pytest, ruff, python — Windows)
├─ shell.sh ← tool wrapper (Linux/macOS)
├─ .github/
│ └─ workflows/ci.yml ← GitHub Actions CI (pytest + Playwright + evidence scripts)
├─ backend/
│ ├─ glossa_lab/ ← FastAPI app + all Python modules
│ │ ├─ api/ ← REST route modules
│ │ │ └─ indus_evidence.py ← Evidence Graph API (library, claims, sweep)
│ │ ├─ experiments/ ← ExperimentBase subclasses + graph JSONs
│ │ ├─ experiment_graph_indus_evidence.py ← 7 Evidence Graph atomic nodes
│ │ ├─ discovery/ ← literature discovery engine + fetchers
│ │ ├─ data/ ← corpora, anchor sets, LM files (cited per H18)
│ │ └─ model_intelligence.py ← HF leaderboard sync + scoring
│ ├─ reports/ ← experiment results, phase syntheses
│ ├─ data/ ← corpus data files (indus_cisi_corpus.json, dravidian_tamil_lm.json, ...)
│ └─ scripts/ ← utility and research scripts (phase44_*.py, build_*.py, ...)
├─ frontend/
│ ├─ src/ ← React source
│ │ └─ components/IndusEvidenceView.tsx ← Evidence Graph three-tab workspace
│ └─ dist/ ← built artefact (committed for server deploy)
├─ glossa-indus/ ← Indus Evidence Graph data store
│ ├─ config/sweep.yaml ← per-project sweep configuration (editable)
│ ├─ literature/ ← registered papers (JSON metadata)
│ ├─ claims/ ← extracted claims per document
│ ├─ hypotheses/ ← hypothesis model YAMLs
│ ├─ raw/user_uploads/ ← user-uploaded PDFs
│ └─ scripts/ ← intake + claims extraction pipeline
├─ tray/ ← system tray app
├─ docs/
│ ├─ USER_GUIDE.md
│ ├─ user-manual.md
│ ├─ architecture.md
│ └─ research/ ← decipherment research docs
├─ services/ ← systemd/launchd/Windows service definitions
└─ corpora/ ← external corpus downloads (gitignored, ~3 GB)
# First-time install (registers autostart, installs deps)
setup-os.cmd install
# Start backend + tray
setup-os.cmd start
# Verify
curl.exe -sf http://localhost:8001/api/v1/healthcd backend && python3 -m venv venv && venv/bin/pip install -e .
sudo systemctl start glossa-lab
curl -sf http://localhost:8001/api/v1/healthOpen http://localhost:8001 in your browser.
All non-trivial work follows the proposal-first cycle in AGENTS.md. Frontend changes require a rebuild before they are visible:
cd frontend && npm run build
# Verify served bundle:
curl.exe -sf http://localhost:8001/ | Select-String 'index-[A-Za-z0-9]+\.js'- H18 — Every data file must have
_citationtraceable toCITATIONS.md - H19 — Foundation check must PASS before external communication
- Current: 13 PASS (archived) / 0 FAIL (
GET /api/v1/research/foundation-check) - V8-V24 decipherment campaign archived 2026-05-17; INDUS_FINAL_ANCHORS.json (137 anchors) preserved
| File | Purpose |
|---|---|
AGENTS.md |
Agent rules, start/stop commands, hard rules |
LEDGER.md |
Session ledger — sole continuity authority |
CITATIONS.md |
Research data citation registry |
docs/USER_GUIDE.md |
Full user guide (all panels) including Evidence Graph |
docs/architecture.md |
System architecture including Evidence Graph layer |
docs/REQUIREMENTS.md |
Formal requirements (R1–R16, incl. R14 Evidence Graph, R15 DB reliability, R16 CI/CD) |
docs/TEST_SPEC.md |
Test specification (TEST-IEA, TEST-EV, TEST-PW-EG, TEST-CI) |
docs/research/ |
Decipherment research documents |
docs/guides/ |
How-to guides (experiments, pipelines, studies) |
glossa-indus/LEDGER.md |
Evidence Graph batch work log |
- 137 verified anchors (7 HIGH, 54 MEDIUM, 75 LOW; 196 nīr-placeholder entries from V8-V24 archive removed)
- 7 HIGH-confidence readings: M342=ay/ā, M176=an/aṇ, M099=kol/koḷ, M062=erutu, M045=yānai, M016=kaḷiṟu, M006=puli
- CISI corpus rebuilt: 179 inscriptions / 1003 tokens / 182 distinct signs
- Dravidian LM expanded: 944 bigrams (from 184) via TamilTB v0.1 integration
- Phase-44 T1 (M342 genitive): UNCERTAIN — cross-site Jaccard 0.429; anchor signs confirmed in genitive context
- Phase-44 T2 (M99 phonetic): SUPPORTED — kol/koḷ (DEDR 2173/2174); M267→M099 title formula 84×
- Phase-43 (May 2026): 231.9σ positional structure confirmed; Hunt tripartite formula 59× lift
- Evidence Graph (May 2026): 11 papers registered, 22 claims extracted across Parpola/FSW/Yadav/Roif/Hunt
- TB correlation: 0.907 (post M267 correction)
- V8-V24 autonomous campaign archived 2026-05-17; INDUS_FINAL_ANCHORS.json preserved at 137 entries
Production — active research. Backend and frontend fully operational at http://localhost:8001.