A production-grade Apache/Nginx access log analyzer purpose-built for SEO and crawl-budget analysis on multi-gigabyte log files. Streams logs in constant memory, identifies 70+ bot families (including modern AI crawlers like GPTBot, ClaudeBot, PerplexityBot, Bytespider), runs reverse-DNS verification to detect spoofed bots, and ships with a React dashboard for filtering, sorting, and visualizing 24 distinct reports.
Tested at scale: a single 16 GB log file with 50.9 million lines completes end-to-end in roughly 2 hours on a consumer laptop, producing 24 CSV reports plus an interactive web UI — without any external Python dependencies beyond Flask for the API.
Most off-the-shelf log analyzers fall into two categories: cloud SaaS that requires uploading raw logs (privacy issues, slow, expensive at scale), or generic GUI tools that load the entire file into memory (impossible for multi-GB files) and ship with bot databases that are stuck in 2020 — missing every AI crawler that has appeared in the last three years.
This tool was built to solve a specific, real problem: understanding which bots are actually consuming your crawl budget and bandwidth at the URL/IP level, on production-scale logs, with a UI you can actually navigate. The 11 GB and 16 GB log files used during development came from real Turkish e-commerce sites (freshscarfs.com, manuka.com.tr) and surfaced concrete findings — for example, a single broken image template pattern was responsible for 4 million 404 errors on one site, and 51% of "unknown" bot traffic turned out to be Meta's ad crawler hiding under a generic bot user-agent fragment.
- AI / LLM crawlers —
GPTBot,ChatGPT-User,OAI-SearchBot,ChatGPT,ClaudeBot,anthropic-ai,Claude-Web,PerplexityBot,Perplexity-User,Google-Extended,Bytespider(TikTok),CCBot(Common Crawl),Meta-ExternalAgent(Llama),FacebookBot,Amazonbot,Applebot-Extended,cohere-ai,Diffbot,YouBot,PetalBot(Huawei),ImagesiftBot,omgilibot,TikTokSpider - Search engines —
Googlebotand all sub-models (Smartphone,Image,Video,News),AdsBot-Google,AdsBot-Google-Mobile,Bingbot,BingPreview,YandexBot,YandexImages,YandexMobileBot,DuckDuckBot,Applebot,Baiduspider,Sogou,Naverbot/Yeti,Seznambot,Qwantify,Qwantbot,coccocbot,coccocbot-image - SEO crawlers —
AhrefsBot,SemrushBot,MJ12bot,DotBot,MajesticSEO,BLEXBot,DataForSeoBot,Screaming Frog SEO Spider,SerpstatBot,SiteAuditBot - Social previews —
facebookexternalhit,meta-externalads,meta-webindexer,Twitterbot,LinkedInBot,Pinterest,Slackbot,Discordbot,TelegramBot,WhatsApp,Embedly,Skype URI Preview,Snap URL Preview,AdBot - Uptime / verification —
UptimeRobot,Pingdom,StatusCake,Site24x7,NewRelicPinger,AASA-Bot(Apple Universal Links),AwarioBot(social listening) - Generic fallbacks — UAs containing
crawler,spider, orbotthat match no specific entry are surfaced asBilinmeyen Crawler/Spider/Botso you can quickly identify what's hiding in the long tail
The matching engine fixes a class of bugs common in homegrown analyzers where generic keywords like bot match before googlebot, mis-labeling all major bots. Order is enforced specific-to-generic at module load time.
Forward-confirmed PTR lookups for Googlebot, Bingbot, YandexBot, YandexMobileBot, DuckDuckBot, Applebot, and Baiduspider. Detects spoofed crawlers that mimic legitimate user-agent strings — surfaced as a verified=False row with the rejection reason. Cached via functools.lru_cache to avoid DNS storms on large logs; capped at the top-1000 most-requesting search-bot IPs for bounded runtime.
| Category | Reports |
|---|---|
| Overview | Hourly time series · Hourly anomaly detection (2σ baseline) |
| Bots | Bot summary · Categories · AI bot detail · Daily bot crawls · Mobile vs desktop split (search bots) · DNS verification results |
| Traffic | Top crawled URLs · URLs with query parameters · Per-IP request rate (peak RPS, peak RPM, sustained rate — for scraper / DoS detection) · Referer analysis |
| Errors | 404 errors · 5xx errors (separate, not bucketed) · 3xx redirects · Other 4xx (401/403/410/429) · Soft 404 candidates (HTTP 200 with suspiciously small response body) |
| Bandwidth | Per category · Top URLs · Top IPs · Per bot name · Crawl budget breakdown (HTML vs image vs script vs CSS vs font vs media vs document) |
| SEO | Sitemap URLs not crawled (uncrawled URLs) · robots.txt violations |
The core analyzer reads logs line-by-line, dispatching each parsed record to 8 stateful analyzer classes. Memory grows only with unique URLs/IPs/referers — not with total request count. Real-world peak RSS:
- 11.3 GB / 31.8M lines → ~600 MB peak
- 16.2 GB / 50.9M lines → ~2 GB peak
--sitemap accepts either a local CSV file path or an http(s):// URL pointing to an XML sitemap (single <urlset> or <sitemapindex> with up to 200 child sitemaps, depth-2 max). Handles gzip-compressed responses transparently. Used to compute the "sitemap URLs not crawled by bots" report — a critical SEO metric that surfaces orphan content.
- Stack: React 18 + Vite 5 + TypeScript (strict) + Tailwind CSS 3 + TanStack Query 5 + TanStack Table 8 + Recharts + React Router 6 + Lucide-React
- Layout: Sidebar with 6 categorized navigation groups; main content area; responsive desktop-first
- Per-report features:
- Server-side pagination (100 rows/page, configurable)
- Multi-column sorting (click headers, descending toggle on second click)
- Debounced full-text search across all string columns
- Locale-aware number formatting (
1.234.567,89Turkish style) - URL truncation with hover-to-reveal full path
- IP columns rendered in monospace
- Interactive charts (bar, horizontal bar, line, pie) — clicking a bar applies that label as a search filter on the table below
- One-click CSV download of the original file
- Live progress streaming: Server-Sent Events feed real-time line counts and log tail to the frontend during analysis runs; no polling
- Multi-job UI: Side-by-side comparison of past analyses (different sites, different dates) without re-running
| Endpoint | Purpose |
|---|---|
POST /api/jobs |
Start a new analysis (JSON body with log path + site URL + sitemap, or multipart upload for small files) |
GET /api/jobs |
List all jobs with status, progress, duration |
GET /api/jobs/:id |
Job detail including grouped report listing |
GET /api/jobs/:id/reports/:filename |
Paginated/filterable/sortable JSON view of a CSV report |
GET /api/jobs/:id/reports/:filename/download |
Raw CSV download |
GET /api/jobs/:id/summary |
Aggregated KPIs across multiple CSVs |
GET /api/jobs/:id/stream |
Server-Sent Events stream for live progress |
DELETE /api/jobs/:id |
Delete job and its output directory |
- Path traversal protection via filename allowlist
- Background job execution via
ThreadPoolExecutor(max_workers=1)so memory-heavy 16 GB jobs don't compete - Auto-port selection: tries 8000–8019, picks the first free port, never kills existing processes
- CORS enabled for any localhost origin (development convenience)
investigate_unknown_bots.py— streams a log and clustersBilinmeyen Bot/Crawler/Spidertraffic by user-agent and IP, then runs reverse-DNS on the top IPs to identify what's hiding in the long tail. Useful when adding new bots tobots.py.patch_bot_reports.py— re-classifies bot-related reports against an updatedbots.pywithout rerunning the full 24-report pipeline. Useful after extending the bot dictionary; reduces 50-minute reruns to ~37 minutes.ip_bot_analyzer.py— standalone CLI for IP × bot breakdown with rate analysis. Lighter weight than the full analyzer when only IP-level data is needed.
cd server
pip install -r requirements.txt # flask, flask-cors — that's itcd web
npm installPython 3.9+ and Node 18+ recommended.
python3 python_log_analyzer.py \
--log-file /path/to/access.log \
--site-url https://www.example.com/ \
--output-dir analiz_sonuclari/ \
--sitemap https://www.example.com/sitemap.xml \
--skip-bot-verification \
--soft-404-threshold 512 \
--top-urls-limit 500Flags:
--log-file(required) — Path to Apache/Nginx Combined Log--site-url— Origin used forrobots.txtfetch and absolute URL resolution--sitemap— CSV path (one URL per line) or XML sitemap URL (HTTP(S))--skip-bot-verification— Skip DNS lookups (recommended for first runs on large logs; verification can be re-run on demand later)--soft-404-threshold— Bytes below which an HTTP 200 response is flagged as a soft-404 candidate (default: 512)--top-urls-limit— Number of top URLs in the "most-crawled" report (default: 500)
In two terminals:
# Terminal 1 — backend
bash server/run.sh
# Picks first free port from 8000–8019; writes choice to server/.port# Terminal 2 — frontend
bash web/run.sh
# Reads backend port from server/.port; Vite picks first free port from 5174 onwardOpen the URL Vite prints (typically http://localhost:5174). From the dashboard:
- Click Yeni Analiz (New Analysis) and provide the log file path on disk plus the site URL and optional sitemap URL
- Watch real-time progress on the job detail page
- When complete, navigate the 24 reports via the sidebar; each opens with a chart on top and a filterable/sortable table below
.
├── bots.py # 70+ bot definitions, identify_bot, verify_bot_cached
├── analyzers.py # 8 streaming analyzer classes
├── python_log_analyzer.py # Main CLI orchestrating bots.py + analyzers.py
├── ip_bot_analyzer.py # Standalone IP × bot CLI with rate analysis
├── investigate_unknown_bots.py # Helper: profile unknown UAs and IPs
├── patch_bot_reports.py # Helper: re-classify bot reports without full rerun
├── server/ # Flask REST API + job runner
│ ├── app.py # All HTTP endpoints
│ ├── jobs.py # Subprocess + thread + meta.json persistence
│ ├── reports.py # CSV → JSON with filter/sort/paginate
│ ├── port_picker.py # Auto-port selection (8000–8019)
│ ├── requirements.txt
│ └── run.sh
└── web/ # Vite + React + TypeScript dashboard
├── package.json
├── vite.config.ts # Proxies /api → backend port from ../server/.port
├── src/
│ ├── api/{client,types}.ts
│ ├── components/ # DataTable, KpiCard, EmptyState, charts/*
│ ├── lib/format.ts # Turkish locale number/byte/date formatters
│ └── pages/ # Dashboard, Jobs, JobDetail, ReportView, Upload, reportConfigs
└── run.sh
| Log size | Lines | Duration | Peak RSS | Reports |
|---|---|---|---|---|
| 11.3 GB | 31.8M | ~50 min | ~600 MB | 23 |
| 16.2 GB | 50.9M | ~2h 3min | ~2 GB | 24 |
Throughput is dominated by Python regex parsing (~30k lines/sec single-threaded). For very large logs (50 GB+), consider running on machines with at least 4 GB free RAM and a fast SSD.
Using the tool on a 16 GB Turkish e-commerce log surfaced findings that would be invisible to traditional analyzers:
- 4 million 404 errors — Top URLs all matched a
//{filename}.jpgdouble-slash pattern, indicating a templating bug where<img src="/{path}">should have been<img src="{path}">. Discovered in 5 minutes via the dashboard's 404 report. - 9.1% of all traffic was bot traffic (vs the typical 2-4% for e-commerce), with
facebookexternalhitalone accounting for 2.94 million requests — an indicator of heavy Meta ads / Instagram traffic generating Open Graph preview fetches. - 51% of the initially "unknown" bot traffic was Meta's ad crawler (
meta-externalads) and Google'sAdsBot-Google— both legitimate but absent from most public bot databases. After adding them tobots.py, the unknown bucket shrank by 97% (296,000 → 9,000 requests) without rerunning the full 50-minute analysis (used thepatch_bot_reports.pyhelper). - ChatGPT-User: 8,619 requests — direct evidence that real users were asking ChatGPT to browse the site, providing a measurable signal of AI-driven referral traffic.
- Apache/Nginx Combined Log format only. Custom log formats require regex modification in
python_log_analyzer.pyandip_bot_analyzer.py(kept in sync between both scripts). - Single-machine, single-process. Designed for one analysis at a time on a workstation, not a distributed cluster. The
ThreadPoolExecutor(max_workers=1)is intentional to prevent multiple 16 GB jobs from thrashing memory. - Python stdlib regex. A C-extension parser like
regexor rewriting the hot path in Rust/Go would yield 5-10x throughput improvements but the project's "no external Python dependencies" constraint is intentional for portability.
MIT — see LICENSE file (if present), otherwise feel free to use, modify, and distribute.
Bug reports, new bot definitions, and report extensions are welcome. The bot definition list in bots.py is the easiest place to contribute — add a new entry to BOT_DEFINITIONS with a real user-agent example and run python3 bots.py to verify the self-test passes.