Skip to content

aayushijain21/cfg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cfg — Natural Language to ClickHouse

A full-stack analytics app that lets users query a retail dataset in plain English. GPT-5 translates the input into ClickHouse SQL using a Context Free Grammar (CFG) that constrains the model's output to syntactically valid, safe SELECT statements.

Architecture

User (browser)
    │
    ▼
Frontend — React + Vite + TypeScript
    │  POST /api/query { natural_language }
    ▼
Backend — Python + FastAPI
    │
    ├── LLM Service (llm.py)
    │     GPT-5 Responses API
    │     tools: [{ type: "custom", format: { type: "grammar", syntax: "lark", definition: CFG } }]
    │     → grammar-constrained SQL string
    │
    ├── CFG Grammar (cfg_grammar.py)  ◄── see below
    │
    └── ClickHouse Service (clickhouse.py)
          executes SQL against train_clean view
          returns columns + rows
    │
    ▼
Frontend renders StatCard (single value) or paginated ResultsTable

Context Free Grammar (backend/app/services/cfg_grammar.py)

The grammar is the core safety and correctness mechanism. It is written in Lark EBNF and passed directly to GPT-5 via the Responses API. The model's token sampler is constrained by a pushdown automaton over the grammar — it is physically unable to emit tokens that violate the rules.

What the grammar enforces:

Rule Effect
select_stmt starts with SELECT No INSERT, UPDATE, DELETE, DROP, CREATE
FROM train_clean is hardcoded Cannot query any other table
column_ref: COLUMN_NAME Only whitelisted columns can be referenced
BLOCKED_NAME terminal customer_name, credit_card, ssn, password can never appear — not even as aliases
agg_func whitelist Only sum, count, avg, min, max
Date functions Only subtractHours, subtractDays, toMonth, toYear, etc.
No subqueries, JOINs, HAVING, UNION Grammar has no production rules for them

Key design: string terminals in Lark have higher priority than regex terminals. BLOCKED_NAME: "customer_name" | "credit_card" | ... causes the lexer to match those tokens before IDENTIFIER can, making them unusable anywhere in the grammar — column, alias, or string value.

Dataset

ClickHouse view: train_clean — Global Superstore retail dataset, 4 years of orders.

Column Type
order_id, customer_id, product_id String
order_date, ship_date DateTime
region, category, sub_category, segment String
ship_mode, city, state, country String
product_name String
sales Float64

Eval Suite (evals/)

A script-based eval framework that tests the system across three dimensions. No test framework required — plain Python.

evals/
├── helpers.py            # run_query(nl) + normalize(sql)
├── test_consistency.py   # semantic equivalents → identical SQL
├── test_correctness.py   # SQL contains expected structural elements
├── test_safety.py        # adversarial inputs + CFG enforcement
└── run_evals.py          # runner with summary + exit code

Run:

cd evals
python run_evals.py

Categories:

  • Consistency — semantically equivalent phrasings (e.g. "total sales by region" vs "show sales per region") produce identical SQL after normalization
  • Correctness — generated SQL contains expected fragments (sum(, group by, subtractdays, etc.)
  • Safety — adversarial inputs cannot produce destructive SQL, JOINs, PII column references, or out-of-scope table access

Current score: 20/20

Running Locally

# Backend
cd backend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env   # fill in OPENAI_API_KEY + ClickHouse vars
uvicorn app.main:app --reload --port 8000

# Frontend
cd frontend
npm install
cp .env.example .env   # set VITE_API_BASE_URL=http://localhost:8000
npm run dev

Deployment

Two Railway services:

Service Root Dir Build Start
Backend backend/ pip install -r requirements.txt uvicorn app.main:app --host 0.0.0.0 --port $PORT
Frontend frontend/ npm run build npm run start

Set VITE_API_BASE_URL to the backend Railway URL before the frontend build. Set CORS_ORIGINS in the backend to the frontend Railway URL.

About

An app implementing GPT 5's Context Free Grammar

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors