SEO Log Analyzer Tools

A production-grade Apache/Nginx access log analyzer purpose-built for SEO and crawl-budget analysis on multi-gigabyte log files. Streams logs in constant memory, identifies 70+ bot families (including modern AI crawlers like GPTBot, ClaudeBot, PerplexityBot, Bytespider), runs reverse-DNS verification to detect spoofed bots, and ships with a React dashboard for filtering, sorting, and visualizing 24 distinct reports.

Tested at scale: a single 16 GB log file with 50.9 million lines completes end-to-end in roughly 2 hours on a consumer laptop, producing 24 CSV reports plus an interactive web UI — without any external Python dependencies beyond Flask for the API.

Why This Tool Exists

Most off-the-shelf log analyzers fall into two categories: cloud SaaS that requires uploading raw logs (privacy issues, slow, expensive at scale), or generic GUI tools that load the entire file into memory (impossible for multi-GB files) and ship with bot databases that are stuck in 2020 — missing every AI crawler that has appeared in the last three years.

This tool was built to solve a specific, real problem: understanding which bots are actually consuming your crawl budget and bandwidth at the URL/IP level, on production-scale logs, with a UI you can actually navigate. The 11 GB and 16 GB log files used during development came from real Turkish e-commerce sites (freshscarfs.com, manuka.com.tr) and surfaced concrete findings — for example, a single broken image template pattern was responsible for 4 million 404 errors on one site, and 51% of "unknown" bot traffic turned out to be Meta's ad crawler hiding under a generic bot user-agent fragment.

Key Features

Bot Detection (70+ canonical bots, specific-first matching)

AI / LLM crawlers — GPTBot, ChatGPT-User, OAI-SearchBot, ChatGPT, ClaudeBot, anthropic-ai, Claude-Web, PerplexityBot, Perplexity-User, Google-Extended, Bytespider (TikTok), CCBot (Common Crawl), Meta-ExternalAgent (Llama), FacebookBot, Amazonbot, Applebot-Extended, cohere-ai, Diffbot, YouBot, PetalBot (Huawei), ImagesiftBot, omgilibot, TikTokSpider
Search engines — Googlebot and all sub-models (Smartphone, Image, Video, News), AdsBot-Google, AdsBot-Google-Mobile, Bingbot, BingPreview, YandexBot, YandexImages, YandexMobileBot, DuckDuckBot, Applebot, Baiduspider, Sogou, Naverbot/Yeti, Seznambot, Qwantify, Qwantbot, coccocbot, coccocbot-image
SEO crawlers — AhrefsBot, SemrushBot, MJ12bot, DotBot, MajesticSEO, BLEXBot, DataForSeoBot, Screaming Frog SEO Spider, SerpstatBot, SiteAuditBot
Social previews — facebookexternalhit, meta-externalads, meta-webindexer, Twitterbot, LinkedInBot, Pinterest, Slackbot, Discordbot, TelegramBot, WhatsApp, Embedly, Skype URI Preview, Snap URL Preview, AdBot
Uptime / verification — UptimeRobot, Pingdom, StatusCake, Site24x7, NewRelicPinger, AASA-Bot (Apple Universal Links), AwarioBot (social listening)
Generic fallbacks — UAs containing crawler, spider, or bot that match no specific entry are surfaced as Bilinmeyen Crawler/Spider/Bot so you can quickly identify what's hiding in the long tail

The matching engine fixes a class of bugs common in homegrown analyzers where generic keywords like bot match before googlebot, mis-labeling all major bots. Order is enforced specific-to-generic at module load time.

Reverse-DNS Bot Verification

Forward-confirmed PTR lookups for Googlebot, Bingbot, YandexBot, YandexMobileBot, DuckDuckBot, Applebot, and Baiduspider. Detects spoofed crawlers that mimic legitimate user-agent strings — surfaced as a verified=False row with the rejection reason. Cached via functools.lru_cache to avoid DNS storms on large logs; capped at the top-1000 most-requesting search-bot IPs for bounded runtime.

24 CSV Reports (organized into 6 categories)

Category	Reports
Overview	Hourly time series · Hourly anomaly detection (2σ baseline)
Bots	Bot summary · Categories · AI bot detail · Daily bot crawls · Mobile vs desktop split (search bots) · DNS verification results
Traffic	Top crawled URLs · URLs with query parameters · Per-IP request rate (peak RPS, peak RPM, sustained rate — for scraper / DoS detection) · Referer analysis
Errors	404 errors · 5xx errors (separate, not bucketed) · 3xx redirects · Other 4xx (401/403/410/429) · Soft 404 candidates (HTTP 200 with suspiciously small response body)
Bandwidth	Per category · Top URLs · Top IPs · Per bot name · Crawl budget breakdown (HTML vs image vs script vs CSS vs font vs media vs document)
SEO	Sitemap URLs not crawled (uncrawled URLs) · robots.txt violations

Streaming Analysis (Constant Memory)

The core analyzer reads logs line-by-line, dispatching each parsed record to 8 stateful analyzer classes. Memory grows only with unique URLs/IPs/referers — not with total request count. Real-world peak RSS:

11.3 GB / 31.8M lines → ~600 MB peak
16.2 GB / 50.9M lines → ~2 GB peak

XML Sitemap Support

--sitemap accepts either a local CSV file path or an http(s):// URL pointing to an XML sitemap (single <urlset> or <sitemapindex> with up to 200 child sitemaps, depth-2 max). Handles gzip-compressed responses transparently. Used to compute the "sitemap URLs not crawled by bots" report — a critical SEO metric that surfaces orphan content.

Web Frontend

Stack: React 18 + Vite 5 + TypeScript (strict) + Tailwind CSS 3 + TanStack Query 5 + TanStack Table 8 + Recharts + React Router 6 + Lucide-React
Layout: Sidebar with 6 categorized navigation groups; main content area; responsive desktop-first
Per-report features:
- Server-side pagination (100 rows/page, configurable)
- Multi-column sorting (click headers, descending toggle on second click)
- Debounced full-text search across all string columns
- Locale-aware number formatting (1.234.567,89 Turkish style)
- URL truncation with hover-to-reveal full path
- IP columns rendered in monospace
- Interactive charts (bar, horizontal bar, line, pie) — clicking a bar applies that label as a search filter on the table below
- One-click CSV download of the original file
Live progress streaming: Server-Sent Events feed real-time line counts and log tail to the frontend during analysis runs; no polling
Multi-job UI: Side-by-side comparison of past analyses (different sites, different dates) without re-running

Backend API (Flask + standard library)

Endpoint	Purpose
`POST /api/jobs`	Start a new analysis (JSON body with log path + site URL + sitemap, or multipart upload for small files)
`GET /api/jobs`	List all jobs with status, progress, duration
`GET /api/jobs/:id`	Job detail including grouped report listing
`GET /api/jobs/:id/reports/:filename`	Paginated/filterable/sortable JSON view of a CSV report
`GET /api/jobs/:id/reports/:filename/download`	Raw CSV download
`GET /api/jobs/:id/summary`	Aggregated KPIs across multiple CSVs
`GET /api/jobs/:id/stream`	Server-Sent Events stream for live progress
`DELETE /api/jobs/:id`	Delete job and its output directory

Path traversal protection via filename allowlist
Background job execution via ThreadPoolExecutor(max_workers=1) so memory-heavy 16 GB jobs don't compete
Auto-port selection: tries 8000–8019, picks the first free port, never kills existing processes
CORS enabled for any localhost origin (development convenience)

Helper Scripts

investigate_unknown_bots.py — streams a log and clusters Bilinmeyen Bot/Crawler/Spider traffic by user-agent and IP, then runs reverse-DNS on the top IPs to identify what's hiding in the long tail. Useful when adding new bots to bots.py.
patch_bot_reports.py — re-classifies bot-related reports against an updated bots.py without rerunning the full 24-report pipeline. Useful after extending the bot dictionary; reduces 50-minute reruns to ~37 minutes.
ip_bot_analyzer.py — standalone CLI for IP × bot breakdown with rate analysis. Lighter weight than the full analyzer when only IP-level data is needed.

Installation

Backend

cd server
pip install -r requirements.txt   # flask, flask-cors — that's it

Frontend

cd web
npm install

Python 3.9+ and Node 18+ recommended.

Usage

Option 1: One-shot CLI

python3 python_log_analyzer.py \
  --log-file /path/to/access.log \
  --site-url https://www.example.com/ \
  --output-dir analiz_sonuclari/ \
  --sitemap https://www.example.com/sitemap.xml \
  --skip-bot-verification \
  --soft-404-threshold 512 \
  --top-urls-limit 500

Flags:

--log-file (required) — Path to Apache/Nginx Combined Log
--site-url — Origin used for robots.txt fetch and absolute URL resolution
--sitemap — CSV path (one URL per line) or XML sitemap URL (HTTP(S))
--skip-bot-verification — Skip DNS lookups (recommended for first runs on large logs; verification can be re-run on demand later)
--soft-404-threshold — Bytes below which an HTTP 200 response is flagged as a soft-404 candidate (default: 512)
--top-urls-limit — Number of top URLs in the "most-crawled" report (default: 500)

Option 2: Web UI

In two terminals:

# Terminal 1 — backend
bash server/run.sh
# Picks first free port from 8000–8019; writes choice to server/.port

# Terminal 2 — frontend
bash web/run.sh
# Reads backend port from server/.port; Vite picks first free port from 5174 onward

Open the URL Vite prints (typically http://localhost:5174). From the dashboard:

Click Yeni Analiz (New Analysis) and provide the log file path on disk plus the site URL and optional sitemap URL
Watch real-time progress on the job detail page
When complete, navigate the 24 reports via the sidebar; each opens with a chart on top and a filterable/sortable table below

Project Structure

.
├── bots.py                       # 70+ bot definitions, identify_bot, verify_bot_cached
├── analyzers.py                  # 8 streaming analyzer classes
├── python_log_analyzer.py        # Main CLI orchestrating bots.py + analyzers.py
├── ip_bot_analyzer.py            # Standalone IP × bot CLI with rate analysis
├── investigate_unknown_bots.py   # Helper: profile unknown UAs and IPs
├── patch_bot_reports.py          # Helper: re-classify bot reports without full rerun
├── server/                       # Flask REST API + job runner
│   ├── app.py                    # All HTTP endpoints
│   ├── jobs.py                   # Subprocess + thread + meta.json persistence
│   ├── reports.py                # CSV → JSON with filter/sort/paginate
│   ├── port_picker.py            # Auto-port selection (8000–8019)
│   ├── requirements.txt
│   └── run.sh
└── web/                          # Vite + React + TypeScript dashboard
    ├── package.json
    ├── vite.config.ts            # Proxies /api → backend port from ../server/.port
    ├── src/
    │   ├── api/{client,types}.ts
    │   ├── components/           # DataTable, KpiCard, EmptyState, charts/*
    │   ├── lib/format.ts         # Turkish locale number/byte/date formatters
    │   └── pages/                # Dashboard, Jobs, JobDetail, ReportView, Upload, reportConfigs
    └── run.sh

Performance Profile

Log size	Lines	Duration	Peak RSS	Reports
11.3 GB	31.8M	~50 min	~600 MB	23
16.2 GB	50.9M	~2h 3min	~2 GB	24

Throughput is dominated by Python regex parsing (~30k lines/sec single-threaded). For very large logs (50 GB+), consider running on machines with at least 4 GB free RAM and a fast SSD.

Real-World Findings (from the test runs)

Using the tool on a 16 GB Turkish e-commerce log surfaced findings that would be invisible to traditional analyzers:

4 million 404 errors — Top URLs all matched a //{filename}.jpg double-slash pattern, indicating a templating bug where <img src="/{path}"> should have been <img src="{path}">. Discovered in 5 minutes via the dashboard's 404 report.
9.1% of all traffic was bot traffic (vs the typical 2-4% for e-commerce), with facebookexternalhit alone accounting for 2.94 million requests — an indicator of heavy Meta ads / Instagram traffic generating Open Graph preview fetches.
51% of the initially "unknown" bot traffic was Meta's ad crawler (meta-externalads) and Google's AdsBot-Google — both legitimate but absent from most public bot databases. After adding them to bots.py, the unknown bucket shrank by 97% (296,000 → 9,000 requests) without rerunning the full 50-minute analysis (used the patch_bot_reports.py helper).
ChatGPT-User: 8,619 requests — direct evidence that real users were asking ChatGPT to browse the site, providing a measurable signal of AI-driven referral traffic.

Limitations

Apache/Nginx Combined Log format only. Custom log formats require regex modification in python_log_analyzer.py and ip_bot_analyzer.py (kept in sync between both scripts).
Single-machine, single-process. Designed for one analysis at a time on a workstation, not a distributed cluster. The ThreadPoolExecutor(max_workers=1) is intentional to prevent multiple 16 GB jobs from thrashing memory.
Python stdlib regex. A C-extension parser like regex or rewriting the hot path in Rust/Go would yield 5-10x throughput improvements but the project's "no external Python dependencies" constraint is intentional for portability.

License

MIT — see LICENSE file (if present), otherwise feel free to use, modify, and distribute.

Contributing

Bug reports, new bot definitions, and report extensions are welcome. The bot definition list in bots.py is the easiest place to contribute — add a new entry to BOT_DEFINITIONS with a real user-agent example and run python3 bots.py to verify the self-test passes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SEO Log Analyzer Tools

Why This Tool Exists

Key Features

Bot Detection (70+ canonical bots, specific-first matching)

Reverse-DNS Bot Verification

24 CSV Reports (organized into 6 categories)

Streaming Analysis (Constant Memory)

XML Sitemap Support

Web Frontend

Backend API (Flask + standard library)

Helper Scripts

Installation

Backend

Frontend

Usage

Option 1: One-shot CLI

Option 2: Web UI

Project Structure

Performance Profile

Real-World Findings (from the test runs)

Limitations

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
server		server
web		web
.gitignore		.gitignore
README.md		README.md
analyzers.py		analyzers.py
bots.py		bots.py
investigate_unknown_bots.py		investigate_unknown_bots.py
ip_bot_analyzer.py		ip_bot_analyzer.py
patch_bot_reports.py		patch_bot_reports.py
python_log_analyzer.py		python_log_analyzer.py

Folders and files

Latest commit

History

Repository files navigation

SEO Log Analyzer Tools

Why This Tool Exists

Key Features

Bot Detection (70+ canonical bots, specific-first matching)

Reverse-DNS Bot Verification

24 CSV Reports (organized into 6 categories)

Streaming Analysis (Constant Memory)

XML Sitemap Support

Web Frontend

Backend API (Flask + standard library)

Helper Scripts

Installation

Backend

Frontend

Usage

Option 1: One-shot CLI

Option 2: Web UI

Project Structure

Performance Profile

Real-World Findings (from the test runs)

Limitations

License

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages