TCG Website Health Checker

一個專業的政府網站健康度檢測工具，提供全面的網站內容更新時間新舊程度分析、內外部連結有效性檢查。

💼 本專案於臺北市政府資訊局服替代役期間開發

✨ 主要功能

🕷️ 智能深度爬蟲：自動發現 sitemap 保留原始網站結構，支援多層級網站爬取
📅 智能日期識別：支援民國年、中英日期等多種格式
🔗 連結有效性檢查：檢測內部頁面和外部連結狀態
📊 專業報告生成：輸出 JSON 摘要和詳細 Excel 統計報告
🚀 高效能處理：支援多進程並行處理和記憶體自動管理
🎯 重複檢測：自動跳過重複頁面和分頁列表
📧 自動郵寄報告：完成後自動打包並發送結果
☁️ 雲端部署支援：支援 GCP 自動化部署與關機

🚀 快速開始

安裝依賴

# 克隆專案
git clone https://github.com/dayoxiao/TCGweb-health-checker.git
cd TCGweb-health-checker

# 安裝 Python 依賴
pip install -r requirements.txt

# 安裝 Playwright 瀏覽器
playwright install chromium

設定網站清單

編輯 config/websites.csv 檔案，支援進階配置：

URL,name,depth,save_html,pagination
https://example.com,範例網站,,,
https://another-site.com,另一個網站,2,TRUE,FALSE
https://special-site.org,特殊網站,1,FALSE,TRUE

欄位說明：

URL: 網站的完整網址（必要）
name: 網站的名稱（選擇性，用於建立資料夾名稱）
depth: 該網站的爬取深度（選擇性，空白使用全域設定）
save_html: 是否儲存HTML檔案（TRUE/FALSE，空白使用全域設定）
pagination: 是否啟用分頁爬取（TRUE/FALSE，空白使用全域設定）

📋 使用參數

參數	說明	預設值
`--depth`	爬蟲最大深度	2
`--config`	網站設定檔路徑	`config/websites.csv`
`--concurrent`	並行處理數量	2
`--no-save-html`	不儲存 HTML 檔案（提升效能）	False
`--no-pagination`	禁用分頁爬取	False
`--max-mem-mb`	subprocess記憶體上限 (MB)	1024

執行檢測

# 本地執行基本檢測（預設深度2層、存html、遇到分頁列表會自動往下爬）
python main.py

# 快速檢測（僅首頁和 sitemap 往下一層）
python main.py --depth 1 --concurrent 2 --no-save-html --no-pagination

# 通用大規模檢測（往下三層，不存html只爬沒有分頁參數的列表第一頁）
python main.py --depth 3 --concurrent 2 --no-save-html --no-pagination

# 雲端執行版本（有自動關機gcp vm內容）
python gcp_main_mpfast.py          # pool管理的簡單模式 (每爬完一個網站重啟一個新的subprocess)
python gcp_main_mpselfqueue.py     # 自管理模式 (按max-mem-mb監測每一個subprocess，超過重啟)

📊 輸出報告

1. Excel 統計報告

位置：output/website_summary_report_YYYY-MM.xlsx

包含以下統計資訊：

網站基本資訊（名稱、URL）
頁面統計（總數、最後更新日期...）
內容時效性（一年前內容數量和比例）
連結狀態（失效內部頁面、失效外部連結數...）
執行資訊（爬取耗時、爬取日期）

2. 詳細頁面摘要檔案

位置：assets/{網站名稱}/page_summary.json

{
  "page_summary": {
    "https://example.com/": {
      "title": "範例網站首頁",
      "last_updated": "2024-09-24",
      "filepath": "[未儲存] 範例網站首頁.html",
      "status": 200,
      "depth": 0,
      "source_page": null
    },
    "https://example.com/about": {
      "title": "關於我們",
      "last_updated": "2024-08-15",
      "filepath": "[未儲存] 關於我們.html",
      "status": 200,
      "depth": 1,
      "source_page": {
        "title": "範例網站首頁",
        "url": "https://example.com/"
      }
    }
  },
  "external_links": {
    "https://external-site.com": {
      "status": 200,
      "source_page": {
        "title": "關於我們",
        "url": "https://example.com/about"
      }
    }
  }
}

3. 錯誤連結報告

自動生成：

error_pages.csv - 失效內部頁面
error_external_links.csv - 失效外部連結

🏗️ 系統架構

├── main.py                    # 本地執行主程式
├── gcp_main_*.py              # 雲端部署版本
├── crawler/
│   └── web_crawler.py         # 網站爬蟲引擎
├── analyzer/
│   └── date_extraction.py     # 智能日期識別
├── reporter/
│   └── report_generation*.py  # 報告生成系統
├── utils/
│   ├── extract_problematic_links.py  # 錯誤連結提取
│   ├── log_writer.py          # 日誌記錄工具
│   └── email_reporter.py      # 自動郵寄功能
├── config/
│   └── websites.csv           # 網站設定檔
├── assets/                    # 爬取資料儲存
└── output/                    # 報告輸出

系統需求

Python 3.8+
充足的磁碟空間（至少20MB）
穩定的網路連線
記憶體：建議每個process預留 4GB 以上空間供執行

☁️ 雲端部署

支援 Google Cloud Platform 自動化部署：

# 設定環境
bash setup-environment.sh

# 啟動雲端執行（含自動關機）
bash run-crawler-fast.sh      # 簡單模式
bash run-crawler-selfqueue.sh # 自管理模式

完全自動化部署可使用 startup script 放置在 vm 開機指令：

startup-script-fast.sh / startup-script-selfqueue.sh
VM 啟動時自動調用上述腳本執行環境安裝 → 程式執行 → 關機

📋 專案資訊

版本: 1.1.0
最後更新: 2025年
維護者: dayoxiao

⭐ 如果這個專案對你有幫助，請給個星星支持一下！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TCG Website Health Checker

✨ 主要功能

🚀 快速開始

安裝依賴

設定網站清單

📋 使用參數

執行檢測

📊 輸出報告

1. Excel 統計報告

2. 詳細頁面摘要檔案

3. 錯誤連結報告

🏗️ 系統架構

系統需求

☁️ 雲端部署

📋 專案資訊

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
analyzer		analyzer
config		config
crawler		crawler
reporter		reporter
utils		utils
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
gcp_main.py		gcp_main.py
gcp_main_mpfast.py		gcp_main_mpfast.py
gcp_main_mpselfqueue.py		gcp_main_mpselfqueue.py
main.py		main.py
requirements.txt		requirements.txt
run-crawler-fast.sh		run-crawler-fast.sh
run-crawler-selfqueue.sh		run-crawler-selfqueue.sh
run_background_nohup.sh		run_background_nohup.sh
setup-environment.sh		setup-environment.sh
startup-script-fast.sh		startup-script-fast.sh
startup-script-selfqueue.sh		startup-script-selfqueue.sh

Folders and files

Latest commit

History

Repository files navigation

TCG Website Health Checker

✨ 主要功能

🚀 快速開始

安裝依賴

設定網站清單

📋 使用參數

執行檢測

📊 輸出報告

1. Excel 統計報告

2. 詳細頁面摘要檔案

3. 錯誤連結報告

🏗️ 系統架構

系統需求

☁️ 雲端部署

📋 專案資訊

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages