Extract structured data from PDFs, HTML files, and URLs into JSON or CSV.
smelt captures tables from any document, uses the Anthropic API to infer a typed schema, and outputs clean structured data. The LLM only names and types the columns — all extraction and coercion is deterministic Go.
go install github.com/akdavidsson/smelt@latestOr build from source:
git clone https://github.com/akdavidsson/smelt
cd smelt
go build -o smelt .export ANTHROPIC_API_KEY=sk-ant-...
smelt https://en.wikipedia.org/wiki/List_of_countries_by_GDP_\(PPP\)_per_capitasmelt [input] [flags]
input can be a local file path, an HTTP/HTTPS URL, or omitted to read from stdin.
# Extract from a URL (auto-selects the largest table)
smelt https://en.wikipedia.org/wiki/List_of_countries_by_GDP_\(PPP\)_per_capita
# Output as CSV
smelt report.html --format csv
# Save to a file
smelt annual_report.pdf --format csv --output data.csv
# Use a query hint to guide table selection and schema naming
smelt https://example.com/financials.html --query "revenue by region"
# Inspect all tables without an API key, then pick one
smelt report.html --raw
smelt report.html --table 2
# Extract every table as a JSON array
smelt https://en.wikipedia.org/wiki/List_of_S%26P_500_companies --all
# Print only the inferred schema
smelt data.html --schema
# Read from stdin
curl -s https://example.com/data.html | smelt --format csv
# Use a specific model
smelt data.html --model claude-opus-4-6
# JavaScript-rendered pages (React, Next.js, etc.)
smelt https://www.transfermarkt.com/premier-league/tabelle/wettbewerb/GB1 --headless
# Extra wait for slow SPAs that load data after idle
smelt https://example.com/spa --headless --wait 5| Flag | Short | Default | Description |
|---|---|---|---|
--format |
-f |
json |
Output format: json, csv, parquet |
--output |
-o |
Write output to file instead of stdout | |
--query |
-q |
Natural language hint for schema inference and table selection | |
--table |
0 (auto) |
Select Nth table by index (1-based); see --raw for indices |
|
--all |
Extract all tables; outputs a JSON array of {name, context, records} |
||
--schema |
Print the inferred schema as JSON and exit | ||
--raw |
Print extracted regions to stderr and exit (no API key required) | ||
--model |
claude-sonnet-4-6 |
Anthropic model to use (overrides config) | |
--headless |
Fetch URL using headless Chromium (handles JS-rendered pages); auto-downloads Chromium if not present | ||
--wait |
0 |
Extra seconds to wait after page idle, for SPAs with slow async loading (use with --headless) |
|
--verbose |
-v |
Enable verbose logging to stderr | |
--ocr |
Enable OCR (not yet implemented) |
Input (file / URL / stdin)
|
v
Capture: parse all tables --> []Region (pure Go: goquery for HTML, pdfcpu for PDF)
|
v
Select region: --table N | --all | auto (largest + query-matched)
|
v
Infer schema via Anthropic API (single API call, JSON output)
|
v
Extract records against schema (pure Go, soft type coercion)
|
v
Write JSON / CSV to stdout or file
Only one API call is made per run (or one per table with --all). The LLM receives a text sample of the table and returns a JSON schema with column names, types, and nullability. It never sees the full document.
{
"name": "gdp_ppp_per_capita",
"description": "Countries ranked by GDP (PPP) per capita",
"columns": [
{"name": "rank", "type": "int", "nullable": false},
{"name": "country", "type": "string", "nullable": false},
{"name": "gdp_per_capita", "type": "int", "nullable": true}
]
}Supported column types: string, int, float, bool, date, datetime.
Use --raw to list all tables found in a document without making an API call:
$ smelt report.html --raw
--- Region 1 (Summary): table: 3 cols x 5 rows ---
...
--- Region 2 (Revenue by Quarter): table: 4 cols x 12 rows ---
...
Then extract a specific one:
smelt report.html --table 2Or extract all at once:
smelt report.html --all--all outputs a JSON array:
[
{
"name": "summary",
"context": "Summary",
"records": [...]
},
{
"name": "revenue_by_quarter",
"context": "Revenue by Quarter",
"records": [...]
}
]Many modern sites (React, Next.js, Vue, etc.) load their table data client-side via JavaScript. A plain HTTP fetch returns an empty shell with no table content.
Use --headless to launch a real Chromium browser, execute the JavaScript, and return the fully rendered HTML:
smelt https://www.transfermarkt.com/premier-league/tabelle/wettbewerb/GB1 --headlessChromium is auto-downloaded to ~/.cache/rod/ on first use if not already installed.
For SPAs that continue loading data after the initial idle signal, add --wait N:
smelt https://example.com/dashboard --headless --wait 5Note: some sites actively detect and block headless browsers. In those cases smelt will still find any data embedded in the page's initial HTML (e.g. Next.js __NEXT_DATA__ JSON), but fully dynamic content cannot be retrieved without a real browser session.
The --query flag does two things:
- Boosts the score of tables whose heading matches your query terms, so
--query "revenue"prefers a table titled "Revenue by Region" over a navigation table. - Passes the hint to the LLM for more accurate schema naming.
smelt https://example.com/report.html --query "annual revenue by product line"smelt reads configuration from the environment and an optional config file.
Environment variable:
export ANTHROPIC_API_KEY=sk-ant-...Config file (~/.smelt/config.yaml):
api_key: sk-ant-...
model: claude-opus-4-6Environment variables take precedence over the config file. The --model flag takes precedence over both.
- stdout — structured data only (JSON or CSV)
- stderr — warnings, verbose logs, and
--rawregion dumps
This makes smelt pipeline-friendly:
smelt https://example.com/data.html --format csv | csvkit | ...
smelt report.pdf | jq '.[] | select(.value > 1000)'Type coercion is soft: if a value cannot be parsed to the inferred type, smelt emits a warning on stderr and falls back to the raw string (or null for nullable columns), rather than aborting.
- Go 1.24+
ANTHROPIC_API_KEY(not required for--raw)
