cfg — Natural Language to ClickHouse

A full-stack analytics app that lets users query a retail dataset in plain English. GPT-5 translates the input into ClickHouse SQL using a Context Free Grammar (CFG) that constrains the model's output to syntactically valid, safe SELECT statements.

Architecture

User (browser)
    │
    ▼
Frontend — React + Vite + TypeScript
    │  POST /api/query { natural_language }
    ▼
Backend — Python + FastAPI
    │
    ├── LLM Service (llm.py)
    │     GPT-5 Responses API
    │     tools: [{ type: "custom", format: { type: "grammar", syntax: "lark", definition: CFG } }]
    │     → grammar-constrained SQL string
    │
    ├── CFG Grammar (cfg_grammar.py)  ◄── see below
    │
    └── ClickHouse Service (clickhouse.py)
          executes SQL against train_clean view
          returns columns + rows
    │
    ▼
Frontend renders StatCard (single value) or paginated ResultsTable

Context Free Grammar (`backend/app/services/cfg_grammar.py`)

The grammar is the core safety and correctness mechanism. It is written in Lark EBNF and passed directly to GPT-5 via the Responses API. The model's token sampler is constrained by a pushdown automaton over the grammar — it is physically unable to emit tokens that violate the rules.

What the grammar enforces:

Rule	Effect
`select_stmt` starts with `SELECT`	No INSERT, UPDATE, DELETE, DROP, CREATE
`FROM train_clean` is hardcoded	Cannot query any other table
`column_ref: COLUMN_NAME`	Only whitelisted columns can be referenced
`BLOCKED_NAME` terminal	`customer_name`, `credit_card`, `ssn`, `password` can never appear — not even as aliases
`agg_func` whitelist	Only `sum`, `count`, `avg`, `min`, `max`
Date functions	Only `subtractHours`, `subtractDays`, `toMonth`, `toYear`, etc.
No subqueries, JOINs, HAVING, UNION	Grammar has no production rules for them

Key design: string terminals in Lark have higher priority than regex terminals. BLOCKED_NAME: "customer_name" | "credit_card" | ... causes the lexer to match those tokens before IDENTIFIER can, making them unusable anywhere in the grammar — column, alias, or string value.

Dataset

ClickHouse view: train_clean — Global Superstore retail dataset, 4 years of orders.

Column	Type
order_id, customer_id, product_id	String
order_date, ship_date	DateTime
region, category, sub_category, segment	String
ship_mode, city, state, country	String
product_name	String
sales	Float64

Eval Suite (`evals/`)

A script-based eval framework that tests the system across three dimensions. No test framework required — plain Python.

evals/
├── helpers.py            # run_query(nl) + normalize(sql)
├── test_consistency.py   # semantic equivalents → identical SQL
├── test_correctness.py   # SQL contains expected structural elements
├── test_safety.py        # adversarial inputs + CFG enforcement
└── run_evals.py          # runner with summary + exit code

Run:

cd evals
python run_evals.py

Categories:

Consistency — semantically equivalent phrasings (e.g. "total sales by region" vs "show sales per region") produce identical SQL after normalization
Correctness — generated SQL contains expected fragments (sum(, group by, subtractdays, etc.)
Safety — adversarial inputs cannot produce destructive SQL, JOINs, PII column references, or out-of-scope table access

Current score: 20/20

Running Locally

# Backend
cd backend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env   # fill in OPENAI_API_KEY + ClickHouse vars
uvicorn app.main:app --reload --port 8000

# Frontend
cd frontend
npm install
cp .env.example .env   # set VITE_API_BASE_URL=http://localhost:8000
npm run dev

Deployment

Two Railway services:

Service	Root Dir	Build	Start
Backend	`backend/`	`pip install -r requirements.txt`	`uvicorn app.main:app --host 0.0.0.0 --port $PORT`
Frontend	`frontend/`	`npm run build`	`npm run start`

Set VITE_API_BASE_URL to the backend Railway URL before the frontend build. Set CORS_ORIGINS in the backend to the frontend Railway URL.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
backend		backend
evals		evals
frontend		frontend
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cfg — Natural Language to ClickHouse

Architecture

Context Free Grammar (`backend/app/services/cfg_grammar.py`)

Dataset

Eval Suite (`evals/`)

Running Locally

Deployment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cfg — Natural Language to ClickHouse

Architecture

Context Free Grammar (backend/app/services/cfg_grammar.py)

Dataset

Eval Suite (evals/)

Running Locally

Deployment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Context Free Grammar (`backend/app/services/cfg_grammar.py`)

Eval Suite (`evals/`)

Packages