dotnetpower
diff --git a/‎.github/copilot-instructions.md‎
Lines changed: 60 additions & 3 deletions b/‎.github/copilot-instructions.md‎
Lines changed: 60 additions & 3 deletions
diff --git a/‎.github/release.yml‎
Lines changed: 53 additions & 0 deletions b/‎.github/release.yml‎
Lines changed: 53 additions & 0 deletions
diff --git a/‎.github/workflows/auto-release.yml‎
Lines changed: 80 additions & 5 deletions b/‎.github/workflows/auto-release.yml‎
Lines changed: 80 additions & 5 deletions
diff --git a/‎docs/en/02-architecture.md‎
Lines changed: 7 additions & 2 deletions b/‎docs/en/02-architecture.md‎
Lines changed: 7 additions & 2 deletions
diff --git a/‎docs/en/09-console-dashboard.md‎
Lines changed: 4 additions & 1 deletion b/‎docs/en/09-console-dashboard.md‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎docs/en/10-mcp-integration.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/en/10-mcp-integration.md‎
Lines changed: 1 addition & 1 deletion
@@ -81,7 +81,7 @@ infomesh/
 │   │   ├── worker.py        #   Async crawl workers
 │   │   ├── scheduler.py     #   URL assignment (DHT-based)
 │   │   ├── parser.py        #   HTML → text extraction
-│   │   ├── robots.py        #   robots.txt compliance
+│   │   ├── robots.py        #   robots.txt compliance + sitemap + crawl-delay
 │   │   ├── dedup.py         #   Deduplication pipeline (URL, SHA-256, SimHash)
 │   │   ├── simhash.py       #   SimHash near-duplicate detection
 │   │   ├── seeds.py         #   Seed URL management & category selection
@@ -184,6 +184,7 @@ Every module, class, and function must have **one clear responsibility**.
   - Extract repeated patterns (e.g., "crawl → index → optionally vector-index") into named helper functions.
 - **CLI commands**: Thin wrappers that delegate to library code. No business logic in Click handlers — they should only parse arguments, call library functions, and format output.
 - **MCP tool handlers**: Same as CLI — dispatch to service-layer functions, don't inline business logic.
+- **Dashboard panels**: Read data from caches or public APIs. Never access private attributes (`_conn`, `_db`) of library classes.
 
 When reviewing code, ask: _"If I change X, what else breaks?"_ If the answer includes unrelated concerns, the code violates SRP and should be refactored.
 
@@ -194,6 +195,36 @@ When reviewing code, ask: _"If I change X, what else breaks?"_ If the answer inc
 - Import order: stdlib → third-party → local (enforced by ruff/isort).
 - Prefer `pathlib.Path` over `os.path`.
 
+### CI Failure Prevention (Lessons Learned)
+
+The following errors have caused CI failures. **Always check for these before committing:**
+
+| Error Code | Description | Prevention |
+|------------|-------------|------------|
+| **E501** | Line too long (>88 chars) | Run `ruff format .` before commit. For Click `help=` strings, use multi-line concatenation: `help=("line1 " "line2")`. For long f-strings, break into variables first. |
+| **I001** | Import block unsorted | Always group imports: stdlib → third-party → local, alphabetically within each group. Run `ruff check --fix` to auto-sort. Never add `import time` below `from dataclasses import dataclass`. |
+| **F541** | f-string without placeholders | Don't write `f"plain string"` — remove the `f` prefix if there are no `{…}` expressions. |
+| **F841** | Local variable assigned but never used | Remove unused variables or prefix with `_` if intentionally unused (e.g., `_unused = func()`). |
+| **F401** | Module imported but unused | Remove unused imports. If imported for side effects or re-export, add `# noqa: F401`. |
+| **F821** | Undefined name used | Ensure all referenced names are imported or defined. Check spelling of variable names. |
+
+**Common pitfalls:**
+- Adding a new `import` at the end of an import block instead of in alphabetical order → **I001**.
+- Writing Click `help="..."` strings that exceed 88 chars → **E501**. Split into `help=("part1 " "part2")`.
+- Copy-pasting code with f-strings but removing the interpolated variables → **F541**.
+- Forgetting to remove debug `import subprocess` or `import pdb` → **F401**.
+
+### No Private API Access in Consumers
+
+Library classes (`LocalStore`, `CreditLedger`, etc.) expose public methods for data access. **Never** access private attributes like `store._conn` or `ledger._conn` in CLI, MCP, or dashboard code. If a needed query doesn't have a public API, add one to the library class first.
+
+### Shared Utilities — No Duplication
+
+Utility functions must exist in exactly one place:
+- **Dashboard utilities** (`_format_uptime`, `_get_peer_id`, `_is_node_running`, `_read_p2p_status`): use `infomesh/dashboard/utils.py`.
+- **Domain-extraction SQL**: use `LocalStore.get_top_domains()` — don't inline raw SQL in dashboard code.
+- **Node status assembly** (store stats + P2P status + credit stats): use `services.py` orchestration — don't duplicate across CLI, MCP, and dashboard.
+
 ### Pre-commit Checks (Required)
 
 Before every commit, **both lint and format checks must pass**:
@@ -259,6 +290,27 @@ If lint errors are found, fix them before committing:
   - `uv run pytest` — run tests.
   - `uv run infomesh start` — run the application.
 
+### Documentation Sync (Required)
+
+Every code change that affects **user-facing behavior, API surface, or configuration** must be accompanied by corresponding documentation updates. Do not consider a task complete until all relevant docs are updated.
+
+**Mandatory update targets:**
+
+| Change Type | Docs to Update |
+|-------------|----------------|
+| New feature / behavior change | `docs/en/` + `docs/ko/` (relevant section), `.github/copilot-instructions.md` |
+| MCP tool schema change (params, output) | `docs/en/10-mcp-integration.md` + `docs/ko/10-mcp-integration.md`, copilot-instructions MCP Tools table |
+| CLI flag / command change | `docs/en/` + `docs/ko/` (relevant section), `README.md` if applicable |
+| Config option change | `docs/en/` + `docs/ko/` (relevant section), copilot-instructions |
+| Credit system / trust change | `docs/en/03-credit-system.md` + `docs/ko/03-credit-system.md`, copilot-instructions |
+| Architecture / protocol change | `docs/en/02-architecture.md` + `docs/ko/02-architecture.md`, copilot-instructions |
+
+**Rules:**
+- **Bilingual**: All documentation exists in both English (`docs/en/`) and Korean (`docs/ko/`). Both must be updated simultaneously.
+- **copilot-instructions.md**: This file is the single source of truth for AI assistants. Keep it synchronized with the actual codebase behavior.
+- **Commit message**: Use the `docs:` prefix for documentation-only changes. When a feature commit includes doc updates, use `feat:` (the docs update is part of the feature).
+- **Checklist**: Before marking a task complete, verify: (1) EN docs updated, (2) KO docs updated, (3) copilot-instructions updated if applicable.
+
 ## Architecture Guidelines
 
 ### P2P / DHT
@@ -273,14 +325,19 @@ If lint errors are found, fix them before committing:
 
 - Always respect `robots.txt` — implement strict opt-out compliance.
 - Default politeness: ≤1 request/second per domain.
+- **Crawl-Delay**: Honors the `Crawl-delay` directive in robots.txt. Per-domain delay is applied automatically and capped at 60 seconds.
+- **Sitemap discovery**: Extracts `Sitemap:` URLs from robots.txt and automatically schedules discovered URLs for crawling.
+- **Canonical tag**: Recognizes `<link rel="canonical">`. If a page declares a different canonical URL, the crawler skips indexing and schedules the canonical URL instead.
+- **Retry with backoff**: Transient HTTP 5xx errors and network failures trigger up to 2 retries with exponential backoff (1s, 2s). SSRF-blocked URLs are never retried.
 - Use `trafilatura` for content extraction. If trafilatura returns `None`, skip the page.
 - Store raw text + metadata (title, URL, crawl timestamp, language).
 - **Seed strategy**: Bundled curated seed lists by category (tech docs, academic, encyclopedia, etc.) + Common Crawl URL import + DHT-assigned URLs + user `crawl_url()` submissions + link following.
 - **Deduplication**: 3-layer approach — URL normalization (canonical), exact dedup (SHA-256 content hash on DHT), near-dedup (SimHash, Hamming distance ≤ 3).
 - **Crawl lock**: Before crawling, publish `hash(url) = CRAWLING` to DHT to prevent race conditions. Timeout after 5 minutes.
 - **SPA/JS rendering**: Phase 0 focuses on static HTML. For JS-heavy pages, use `js_required` DHT tag to delegate to Playwright-capable nodes (Phase 4).
 - **Bandwidth limits**: Default ≤5 Mbps upload / 10 Mbps download for P2P. Configurable via `~/.infomesh/config.toml`. Max 5 concurrent crawl connections per node.
-- **`crawl_url()` rate limiting**: 60 URLs/hr per node, 10 pending URLs/domain, max depth=3.
+- **`crawl_url()` rate limiting**: 60 URLs/hr per node, 10 pending URLs/domain, depth unlimited by default (0=unlimited, configurable).
+- **Force re-crawl**: `crawl_url(url, force=True)` bypasses URL dedup to re-crawl previously visited pages. Useful for refreshing stale content or discovering new child links after depth limits were changed.
 
 ### Indexing
 
@@ -306,7 +363,7 @@ The MCP server exposes these tools:
 | `search(query, limit)` | Full network search, merges local + remote results |
 | `search_local(query, limit)` | Local-only search (works offline) |
 | `fetch_page(url)` | Return full text for a URL (from index or live crawl) |
-| `crawl_url(url, depth)` | Add a URL to the network and crawl it |
+| `crawl_url(url, depth, force)` | Add a URL to the network and crawl it. `force=True` bypasses dedup. |
 | `network_stats()` | Network status: peer count, index size, credits |
 
 ### Local LLM Summarization
 
@@ -0,0 +1,53 @@
+# GitHub Releases — auto-generated release notes configuration
+# https://docs.github.com/en/repositories/releasing-projects-on-github/automatically-generated-release-notes
+
+changelog:
+  exclude:
+    labels:
+      - skip-changelog
+    authors:
+      - dependabot
+      - github-actions[bot]
+
+  categories:
+    - title: "🚀 New Features"
+      labels:
+        - enhancement
+        - feature
+        - feat
+
+    - title: "🐛 Bug Fixes"
+      labels:
+        - bug
+        - fix
+        - bugfix
+
+    - title: "📖 Documentation"
+      labels:
+        - documentation
+        - docs
+
+    - title: "⚡ Performance"
+      labels:
+        - performance
+        - perf
+
+    - title: "🔒 Security"
+      labels:
+        - security
+
+    - title: "🧪 Tests"
+      labels:
+        - test
+        - tests
+
+    - title: "🏗️ Infrastructure & CI"
+      labels:
+        - ci
+        - infrastructure
+        - build
+        - chore
+
+    - title: "🔄 Other Changes"
+      labels:
+        - "*"
@@ -93,6 +93,12 @@ jobs:
           echo "Updated pyproject.toml:"
           head -5 pyproject.toml
 
+      - name: Update version in __init__.py
+        run: |
+          sed -i 's/^__version__ = ".*"/__version__ = "${{ steps.version.outputs.version }}"/' infomesh/__init__.py
+          echo "Updated __init__.py:"
+          grep __version__ infomesh/__init__.py
+
       - name: Build package
         run: uv build
 
@@ -106,19 +112,88 @@ jobs:
         run: |
           git config user.name "github-actions[bot]"
           git config user.email "github-actions[bot]@users.noreply.github.com"
-          git add pyproject.toml
+          git add pyproject.toml infomesh/__init__.py
           git commit -m "chore: release v${{ steps.version.outputs.version }} [auto-release]"
           git tag "${{ steps.version.outputs.tag }}"
           git push origin main --follow-tags
 
+      - name: Generate release notes from commits
+        id: notes
+        run: |
+          # Get previous tag
+          prev_tag=$(git describe --tags --abbrev=0 HEAD^ 2>/dev/null || echo "")
+
+          if [ -z "$prev_tag" ]; then
+            range="HEAD"
+          else
+            range="${prev_tag}..HEAD"
+          fi
+
+          # Categorize commits by conventional-commit prefix
+          features=""
+          fixes=""
+          docs=""
+          others=""
+
+          while IFS= read -r line; do
+            # Skip auto-release commits and empty lines
+            if [ -z "$line" ] || echo "$line" | grep -q '\[auto-release\]'; then
+              continue
+            fi
+            # Strip conventional-commit prefix: feat(scope): msg → msg
+            if echo "$line" | grep -qi '^feat'; then
+              msg=$(echo "$line" | sed -E 's/^feat([(][^)]*[)])?[[:space:]]*:[[:space:]]*//')
+              features="${features}- ${msg}\n"
+            elif echo "$line" | grep -qi '^fix'; then
+              msg=$(echo "$line" | sed -E 's/^fix([(][^)]*[)])?[[:space:]]*:[[:space:]]*//')
+              fixes="${fixes}- ${msg}\n"
+            elif echo "$line" | grep -qi '^docs'; then
+              msg=$(echo "$line" | sed -E 's/^docs([(][^)]*[)])?[[:space:]]*:[[:space:]]*//')
+              docs="${docs}- ${msg}\n"
+            else
+              others="${others}- ${line}\n"
+            fi
+          done <<< "$(git log "${range}" --pretty=format:'%s' --no-merges)"
+
+          # Build release body
+          body=""
+          if [ -n "$features" ]; then
+            body="${body}## 🚀 New Features\n${features}\n"
+          fi
+          if [ -n "$fixes" ]; then
+            body="${body}## 🐛 Bug Fixes\n${fixes}\n"
+          fi
+          if [ -n "$docs" ]; then
+            body="${body}## 📖 Documentation\n${docs}\n"
+          fi
+          if [ -n "$others" ]; then
+            body="${body}## 🔄 Other Changes\n${others}\n"
+          fi
+
+          # If no categorized commits, fall back to auto-generated
+          if [ -z "$body" ]; then
+            echo "use_generate=true" >> "$GITHUB_OUTPUT"
+          else
+            # Write to file (handles multiline safely)
+            printf '%b' "$body" > /tmp/release-notes.md
+            echo "use_generate=false" >> "$GITHUB_OUTPUT"
+          fi
+
       - name: Create GitHub Release
         env:
           GH_TOKEN: ${{ github.token }}
         run: |
-          gh release create "${{ steps.version.outputs.tag }}" \
-            dist/* \
-            --title "v${{ steps.version.outputs.version }}" \
-            --generate-notes
+          if [ "${{ steps.notes.outputs.use_generate }}" = "true" ]; then
+            gh release create "${{ steps.version.outputs.tag }}" \
+              dist/* \
+              --title "v${{ steps.version.outputs.version }}" \
+              --generate-notes
+          else
+            gh release create "${{ steps.version.outputs.tag }}" \
+              dist/* \
+              --title "v${{ steps.version.outputs.version }}" \
+              --notes-file /tmp/release-notes.md
+          fi
 
       - name: Publish to PyPI
         uses: pypa/gh-action-pypi-publish@release/v1
@@ -101,11 +101,16 @@ Workload distribution:
 Crawling rules:
 - Always strictly respect `robots.txt`
 - Default politeness: ≤1 request/second per domain
+- **Crawl-Delay**: Honors the `Crawl-delay` directive in robots.txt. Per-domain delay is applied automatically and capped at 60 seconds to prevent abuse.
+- **Sitemap discovery**: Extracts `Sitemap:` URLs from robots.txt and automatically schedules discovered URLs for crawling.
+- **Canonical tag**: Recognizes `<link rel="canonical">` in HTML. If a page declares a different canonical URL, the crawler skips indexing the current page and schedules the canonical URL instead — preventing duplicate content in the index.
+- **Retry with backoff**: Transient HTTP errors (5xx) and network failures trigger automatic retries (up to 2 retries with exponential backoff: 1s, 2s). SSRF-blocked URLs are never retried.
 - Content extraction: `trafilatura` primary, `BeautifulSoup` fallback
 - Storage: raw text + metadata (title, URL, crawl timestamp, language)
 - **Crawl lock**: Before crawling, publish `hash(url) = CRAWLING` to DHT to prevent multiple nodes crawling the same URL. Lock timeout: 5 minutes.
 - **SPA/JS rendering**: Most content is extractable from static HTML. For JavaScript-heavy pages, a `js_required` DHT tag triggers delegation to nodes with Playwright/headless browser capability. Phase 0 (MVP) focuses on static HTML only.
 - **Bandwidth limits**: Default ≤5 Mbps upload / 10 Mbps download for P2P traffic. Configurable via `~/.infomesh/config.toml`. Crawl concurrency: max 5 simultaneous connections per node (adjustable).
+- **Force re-crawl**: `crawl_url(url, force=True)` bypasses URL dedup to re-crawl previously visited pages. Useful for refreshing stale content or discovering new child links after depth limits were changed.
 
 ---
 
@@ -288,7 +293,7 @@ log_level = "info"                  # debug, info, warning, error
 [crawl]
 max_concurrent = 5                  # simultaneous HTTP connections
 politeness_delay = 1.0              # seconds between requests to same domain
-max_depth = 3                       # link-following depth limit
+max_depth = 0                       # 0 = unlimited (rate limits & dedup control breadth)
 
 [network]
 upload_limit_mbps = 5               # P2P upload bandwidth cap
@@ -331,7 +336,7 @@ infomesh search --local "query"     # Local-only search
 
 # Management
 infomesh config show                # Display current configuration
-infomesh config set crawl.max_depth 5
+infomesh config set crawl.max_depth 10  # set hard depth limit (0=unlimited)
 infomesh keys export                # Export keys for backup
 infomesh keys rotate                # Rotate node identity key
 
 
@@ -24,7 +24,8 @@ mobile terminal apps (Termux, Blink, etc.), and low-spec server environments.
 │  │ State:  🟢 Running   │  │ RAM:  ██████░░░░  62%    │  │
 │  │ Uptime: 3d 14h 22m  │  │ Disk: ████████░░  81%    │  │
 │  │ Version: 0.1.0      │  │ Net↑: 2.1/5.0 Mbps       │  │
-│  │ Data dir: ~/.info... │  │ Net↓: 4.3/10.0 Mbps      │  │
+│  │ GitHub:  user@e...   │  │ Net↓: 4.3/10.0 Mbps      │  │
+│  │ Data dir: ~/.info... │  │                           │  │
 │  └──────────────────────┘  └──────────────────────────┘  │
 │                                                          │
 │  ┌─ Activity (last 1h) ──────────────────────────────┐  │
@@ -44,6 +45,8 @@ mobile terminal apps (Termux, Blink, etc.), and low-spec server environments.
 ```
 
 > **Implementation Notes**: NodeInfoPanel shows Data dir instead of Peers.
+> GitHub email is auto-detected from `git config user.email` and shown if available;
+> displayed as `not connected` otherwise. The value is resolved once and cached.
 > ResourcePanel displays CPU/RAM when `psutil` is installed, N/A otherwise.
 > Resource bar colors auto-switch based on usage (≥90% red, ≥70% yellow).
 
 
@@ -17,7 +17,7 @@ so your AI assistant can search the web through your own decentralized index.
 | `search` | Search the P2P network (local + distributed) | `query` (string), `limit` (int, default 10) |
 | `search_local` | Search local index only (works offline) | `query` (string), `limit` (int, default 10) |
 | `fetch_page` | Fetch full text of a URL (cached or live) | `url` (string) |
-| `crawl_url` | Crawl a URL and add to the index | `url` (string), `depth` (int, default 0, max 3) |
+| `crawl_url` | Crawl a URL and add to the index | `url` (string), `depth` (int, default 0), `force` (bool, default false) |
 | `network_stats` | Node status: index size, peers, credits | *(none)* |
 
 ---