Skip to content

Commit 7b6a77b

Browse files
committed
Add shared utilities for dashboard and integration tests
- Implemented utility functions in `utils.py` for formatting uptime, checking node status, and reading P2P status. - Created an integration test suite in `integration_dashboard.py` to validate functionality across all dashboard tabs, including overview, crawl, search, network, credits, and settings. - Added checks for UI components, data rendering, and key bindings, ensuring robust testing of the dashboard interface.
1 parent 5b9a78c commit 7b6a77b

64 files changed

Lines changed: 3153 additions & 765 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

β€Ž.github/copilot-instructions.mdβ€Ž

Lines changed: 60 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ infomesh/
8181
β”‚ β”‚ β”œβ”€β”€ worker.py # Async crawl workers
8282
β”‚ β”‚ β”œβ”€β”€ scheduler.py # URL assignment (DHT-based)
8383
β”‚ β”‚ β”œβ”€β”€ parser.py # HTML β†’ text extraction
84-
β”‚ β”‚ β”œβ”€β”€ robots.py # robots.txt compliance
84+
β”‚ β”‚ β”œβ”€β”€ robots.py # robots.txt compliance + sitemap + crawl-delay
8585
β”‚ β”‚ β”œβ”€β”€ dedup.py # Deduplication pipeline (URL, SHA-256, SimHash)
8686
β”‚ β”‚ β”œβ”€β”€ simhash.py # SimHash near-duplicate detection
8787
β”‚ β”‚ β”œβ”€β”€ seeds.py # Seed URL management & category selection
@@ -184,6 +184,7 @@ Every module, class, and function must have **one clear responsibility**.
184184
- Extract repeated patterns (e.g., "crawl β†’ index β†’ optionally vector-index") into named helper functions.
185185
- **CLI commands**: Thin wrappers that delegate to library code. No business logic in Click handlers β€” they should only parse arguments, call library functions, and format output.
186186
- **MCP tool handlers**: Same as CLI β€” dispatch to service-layer functions, don't inline business logic.
187+
- **Dashboard panels**: Read data from caches or public APIs. Never access private attributes (`_conn`, `_db`) of library classes.
187188

188189
When reviewing code, ask: _"If I change X, what else breaks?"_ If the answer includes unrelated concerns, the code violates SRP and should be refactored.
189190

@@ -194,6 +195,36 @@ When reviewing code, ask: _"If I change X, what else breaks?"_ If the answer inc
194195
- Import order: stdlib β†’ third-party β†’ local (enforced by ruff/isort).
195196
- Prefer `pathlib.Path` over `os.path`.
196197

198+
### CI Failure Prevention (Lessons Learned)
199+
200+
The following errors have caused CI failures. **Always check for these before committing:**
201+
202+
| Error Code | Description | Prevention |
203+
|------------|-------------|------------|
204+
| **E501** | Line too long (>88 chars) | Run `ruff format .` before commit. For Click `help=` strings, use multi-line concatenation: `help=("line1 " "line2")`. For long f-strings, break into variables first. |
205+
| **I001** | Import block unsorted | Always group imports: stdlib β†’ third-party β†’ local, alphabetically within each group. Run `ruff check --fix` to auto-sort. Never add `import time` below `from dataclasses import dataclass`. |
206+
| **F541** | f-string without placeholders | Don't write `f"plain string"` β€” remove the `f` prefix if there are no `{…}` expressions. |
207+
| **F841** | Local variable assigned but never used | Remove unused variables or prefix with `_` if intentionally unused (e.g., `_unused = func()`). |
208+
| **F401** | Module imported but unused | Remove unused imports. If imported for side effects or re-export, add `# noqa: F401`. |
209+
| **F821** | Undefined name used | Ensure all referenced names are imported or defined. Check spelling of variable names. |
210+
211+
**Common pitfalls:**
212+
- Adding a new `import` at the end of an import block instead of in alphabetical order β†’ **I001**.
213+
- Writing Click `help="..."` strings that exceed 88 chars β†’ **E501**. Split into `help=("part1 " "part2")`.
214+
- Copy-pasting code with f-strings but removing the interpolated variables β†’ **F541**.
215+
- Forgetting to remove debug `import subprocess` or `import pdb` β†’ **F401**.
216+
217+
### No Private API Access in Consumers
218+
219+
Library classes (`LocalStore`, `CreditLedger`, etc.) expose public methods for data access. **Never** access private attributes like `store._conn` or `ledger._conn` in CLI, MCP, or dashboard code. If a needed query doesn't have a public API, add one to the library class first.
220+
221+
### Shared Utilities β€” No Duplication
222+
223+
Utility functions must exist in exactly one place:
224+
- **Dashboard utilities** (`_format_uptime`, `_get_peer_id`, `_is_node_running`, `_read_p2p_status`): use `infomesh/dashboard/utils.py`.
225+
- **Domain-extraction SQL**: use `LocalStore.get_top_domains()` β€” don't inline raw SQL in dashboard code.
226+
- **Node status assembly** (store stats + P2P status + credit stats): use `services.py` orchestration β€” don't duplicate across CLI, MCP, and dashboard.
227+
197228
### Pre-commit Checks (Required)
198229

199230
Before every commit, **both lint and format checks must pass**:
@@ -259,6 +290,27 @@ If lint errors are found, fix them before committing:
259290
- `uv run pytest` β€” run tests.
260291
- `uv run infomesh start` β€” run the application.
261292

293+
### Documentation Sync (Required)
294+
295+
Every code change that affects **user-facing behavior, API surface, or configuration** must be accompanied by corresponding documentation updates. Do not consider a task complete until all relevant docs are updated.
296+
297+
**Mandatory update targets:**
298+
299+
| Change Type | Docs to Update |
300+
|-------------|----------------|
301+
| New feature / behavior change | `docs/en/` + `docs/ko/` (relevant section), `.github/copilot-instructions.md` |
302+
| MCP tool schema change (params, output) | `docs/en/10-mcp-integration.md` + `docs/ko/10-mcp-integration.md`, copilot-instructions MCP Tools table |
303+
| CLI flag / command change | `docs/en/` + `docs/ko/` (relevant section), `README.md` if applicable |
304+
| Config option change | `docs/en/` + `docs/ko/` (relevant section), copilot-instructions |
305+
| Credit system / trust change | `docs/en/03-credit-system.md` + `docs/ko/03-credit-system.md`, copilot-instructions |
306+
| Architecture / protocol change | `docs/en/02-architecture.md` + `docs/ko/02-architecture.md`, copilot-instructions |
307+
308+
**Rules:**
309+
- **Bilingual**: All documentation exists in both English (`docs/en/`) and Korean (`docs/ko/`). Both must be updated simultaneously.
310+
- **copilot-instructions.md**: This file is the single source of truth for AI assistants. Keep it synchronized with the actual codebase behavior.
311+
- **Commit message**: Use the `docs:` prefix for documentation-only changes. When a feature commit includes doc updates, use `feat:` (the docs update is part of the feature).
312+
- **Checklist**: Before marking a task complete, verify: (1) EN docs updated, (2) KO docs updated, (3) copilot-instructions updated if applicable.
313+
262314
## Architecture Guidelines
263315

264316
### P2P / DHT
@@ -273,14 +325,19 @@ If lint errors are found, fix them before committing:
273325

274326
- Always respect `robots.txt` β€” implement strict opt-out compliance.
275327
- Default politeness: ≀1 request/second per domain.
328+
- **Crawl-Delay**: Honors the `Crawl-delay` directive in robots.txt. Per-domain delay is applied automatically and capped at 60 seconds.
329+
- **Sitemap discovery**: Extracts `Sitemap:` URLs from robots.txt and automatically schedules discovered URLs for crawling.
330+
- **Canonical tag**: Recognizes `<link rel="canonical">`. If a page declares a different canonical URL, the crawler skips indexing and schedules the canonical URL instead.
331+
- **Retry with backoff**: Transient HTTP 5xx errors and network failures trigger up to 2 retries with exponential backoff (1s, 2s). SSRF-blocked URLs are never retried.
276332
- Use `trafilatura` for content extraction. If trafilatura returns `None`, skip the page.
277333
- Store raw text + metadata (title, URL, crawl timestamp, language).
278334
- **Seed strategy**: Bundled curated seed lists by category (tech docs, academic, encyclopedia, etc.) + Common Crawl URL import + DHT-assigned URLs + user `crawl_url()` submissions + link following.
279335
- **Deduplication**: 3-layer approach β€” URL normalization (canonical), exact dedup (SHA-256 content hash on DHT), near-dedup (SimHash, Hamming distance ≀ 3).
280336
- **Crawl lock**: Before crawling, publish `hash(url) = CRAWLING` to DHT to prevent race conditions. Timeout after 5 minutes.
281337
- **SPA/JS rendering**: Phase 0 focuses on static HTML. For JS-heavy pages, use `js_required` DHT tag to delegate to Playwright-capable nodes (Phase 4).
282338
- **Bandwidth limits**: Default ≀5 Mbps upload / 10 Mbps download for P2P. Configurable via `~/.infomesh/config.toml`. Max 5 concurrent crawl connections per node.
283-
- **`crawl_url()` rate limiting**: 60 URLs/hr per node, 10 pending URLs/domain, max depth=3.
339+
- **`crawl_url()` rate limiting**: 60 URLs/hr per node, 10 pending URLs/domain, depth unlimited by default (0=unlimited, configurable).
340+
- **Force re-crawl**: `crawl_url(url, force=True)` bypasses URL dedup to re-crawl previously visited pages. Useful for refreshing stale content or discovering new child links after depth limits were changed.
284341

285342
### Indexing
286343

@@ -306,7 +363,7 @@ The MCP server exposes these tools:
306363
| `search(query, limit)` | Full network search, merges local + remote results |
307364
| `search_local(query, limit)` | Local-only search (works offline) |
308365
| `fetch_page(url)` | Return full text for a URL (from index or live crawl) |
309-
| `crawl_url(url, depth)` | Add a URL to the network and crawl it |
366+
| `crawl_url(url, depth, force)` | Add a URL to the network and crawl it. `force=True` bypasses dedup. |
310367
| `network_stats()` | Network status: peer count, index size, credits |
311368

312369
### Local LLM Summarization

β€Ž.github/release.ymlβ€Ž

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# GitHub Releases β€” auto-generated release notes configuration
2+
# https://docs.github.com/en/repositories/releasing-projects-on-github/automatically-generated-release-notes
3+
4+
changelog:
5+
exclude:
6+
labels:
7+
- skip-changelog
8+
authors:
9+
- dependabot
10+
- github-actions[bot]
11+
12+
categories:
13+
- title: "πŸš€ New Features"
14+
labels:
15+
- enhancement
16+
- feature
17+
- feat
18+
19+
- title: "πŸ› Bug Fixes"
20+
labels:
21+
- bug
22+
- fix
23+
- bugfix
24+
25+
- title: "πŸ“– Documentation"
26+
labels:
27+
- documentation
28+
- docs
29+
30+
- title: "⚑ Performance"
31+
labels:
32+
- performance
33+
- perf
34+
35+
- title: "πŸ”’ Security"
36+
labels:
37+
- security
38+
39+
- title: "πŸ§ͺ Tests"
40+
labels:
41+
- test
42+
- tests
43+
44+
- title: "πŸ—οΈ Infrastructure & CI"
45+
labels:
46+
- ci
47+
- infrastructure
48+
- build
49+
- chore
50+
51+
- title: "πŸ”„ Other Changes"
52+
labels:
53+
- "*"

β€Ž.github/workflows/auto-release.ymlβ€Ž

Lines changed: 80 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,12 @@ jobs:
9393
echo "Updated pyproject.toml:"
9494
head -5 pyproject.toml
9595
96+
- name: Update version in __init__.py
97+
run: |
98+
sed -i 's/^__version__ = ".*"/__version__ = "${{ steps.version.outputs.version }}"/' infomesh/__init__.py
99+
echo "Updated __init__.py:"
100+
grep __version__ infomesh/__init__.py
101+
96102
- name: Build package
97103
run: uv build
98104

@@ -106,19 +112,88 @@ jobs:
106112
run: |
107113
git config user.name "github-actions[bot]"
108114
git config user.email "github-actions[bot]@users.noreply.github.com"
109-
git add pyproject.toml
115+
git add pyproject.toml infomesh/__init__.py
110116
git commit -m "chore: release v${{ steps.version.outputs.version }} [auto-release]"
111117
git tag "${{ steps.version.outputs.tag }}"
112118
git push origin main --follow-tags
113119
120+
- name: Generate release notes from commits
121+
id: notes
122+
run: |
123+
# Get previous tag
124+
prev_tag=$(git describe --tags --abbrev=0 HEAD^ 2>/dev/null || echo "")
125+
126+
if [ -z "$prev_tag" ]; then
127+
range="HEAD"
128+
else
129+
range="${prev_tag}..HEAD"
130+
fi
131+
132+
# Categorize commits by conventional-commit prefix
133+
features=""
134+
fixes=""
135+
docs=""
136+
others=""
137+
138+
while IFS= read -r line; do
139+
# Skip auto-release commits and empty lines
140+
if [ -z "$line" ] || echo "$line" | grep -q '\[auto-release\]'; then
141+
continue
142+
fi
143+
# Strip conventional-commit prefix: feat(scope): msg β†’ msg
144+
if echo "$line" | grep -qi '^feat'; then
145+
msg=$(echo "$line" | sed -E 's/^feat([(][^)]*[)])?[[:space:]]*:[[:space:]]*//')
146+
features="${features}- ${msg}\n"
147+
elif echo "$line" | grep -qi '^fix'; then
148+
msg=$(echo "$line" | sed -E 's/^fix([(][^)]*[)])?[[:space:]]*:[[:space:]]*//')
149+
fixes="${fixes}- ${msg}\n"
150+
elif echo "$line" | grep -qi '^docs'; then
151+
msg=$(echo "$line" | sed -E 's/^docs([(][^)]*[)])?[[:space:]]*:[[:space:]]*//')
152+
docs="${docs}- ${msg}\n"
153+
else
154+
others="${others}- ${line}\n"
155+
fi
156+
done <<< "$(git log "${range}" --pretty=format:'%s' --no-merges)"
157+
158+
# Build release body
159+
body=""
160+
if [ -n "$features" ]; then
161+
body="${body}## πŸš€ New Features\n${features}\n"
162+
fi
163+
if [ -n "$fixes" ]; then
164+
body="${body}## πŸ› Bug Fixes\n${fixes}\n"
165+
fi
166+
if [ -n "$docs" ]; then
167+
body="${body}## πŸ“– Documentation\n${docs}\n"
168+
fi
169+
if [ -n "$others" ]; then
170+
body="${body}## πŸ”„ Other Changes\n${others}\n"
171+
fi
172+
173+
# If no categorized commits, fall back to auto-generated
174+
if [ -z "$body" ]; then
175+
echo "use_generate=true" >> "$GITHUB_OUTPUT"
176+
else
177+
# Write to file (handles multiline safely)
178+
printf '%b' "$body" > /tmp/release-notes.md
179+
echo "use_generate=false" >> "$GITHUB_OUTPUT"
180+
fi
181+
114182
- name: Create GitHub Release
115183
env:
116184
GH_TOKEN: ${{ github.token }}
117185
run: |
118-
gh release create "${{ steps.version.outputs.tag }}" \
119-
dist/* \
120-
--title "v${{ steps.version.outputs.version }}" \
121-
--generate-notes
186+
if [ "${{ steps.notes.outputs.use_generate }}" = "true" ]; then
187+
gh release create "${{ steps.version.outputs.tag }}" \
188+
dist/* \
189+
--title "v${{ steps.version.outputs.version }}" \
190+
--generate-notes
191+
else
192+
gh release create "${{ steps.version.outputs.tag }}" \
193+
dist/* \
194+
--title "v${{ steps.version.outputs.version }}" \
195+
--notes-file /tmp/release-notes.md
196+
fi
122197
123198
- name: Publish to PyPI
124199
uses: pypa/gh-action-pypi-publish@release/v1

β€Ždocs/en/02-architecture.mdβ€Ž

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -101,11 +101,16 @@ Workload distribution:
101101
Crawling rules:
102102
- Always strictly respect `robots.txt`
103103
- Default politeness: ≀1 request/second per domain
104+
- **Crawl-Delay**: Honors the `Crawl-delay` directive in robots.txt. Per-domain delay is applied automatically and capped at 60 seconds to prevent abuse.
105+
- **Sitemap discovery**: Extracts `Sitemap:` URLs from robots.txt and automatically schedules discovered URLs for crawling.
106+
- **Canonical tag**: Recognizes `<link rel="canonical">` in HTML. If a page declares a different canonical URL, the crawler skips indexing the current page and schedules the canonical URL instead β€” preventing duplicate content in the index.
107+
- **Retry with backoff**: Transient HTTP errors (5xx) and network failures trigger automatic retries (up to 2 retries with exponential backoff: 1s, 2s). SSRF-blocked URLs are never retried.
104108
- Content extraction: `trafilatura` primary, `BeautifulSoup` fallback
105109
- Storage: raw text + metadata (title, URL, crawl timestamp, language)
106110
- **Crawl lock**: Before crawling, publish `hash(url) = CRAWLING` to DHT to prevent multiple nodes crawling the same URL. Lock timeout: 5 minutes.
107111
- **SPA/JS rendering**: Most content is extractable from static HTML. For JavaScript-heavy pages, a `js_required` DHT tag triggers delegation to nodes with Playwright/headless browser capability. Phase 0 (MVP) focuses on static HTML only.
108112
- **Bandwidth limits**: Default ≀5 Mbps upload / 10 Mbps download for P2P traffic. Configurable via `~/.infomesh/config.toml`. Crawl concurrency: max 5 simultaneous connections per node (adjustable).
113+
- **Force re-crawl**: `crawl_url(url, force=True)` bypasses URL dedup to re-crawl previously visited pages. Useful for refreshing stale content or discovering new child links after depth limits were changed.
109114

110115
---
111116

@@ -288,7 +293,7 @@ log_level = "info" # debug, info, warning, error
288293
[crawl]
289294
max_concurrent = 5 # simultaneous HTTP connections
290295
politeness_delay = 1.0 # seconds between requests to same domain
291-
max_depth = 3 # link-following depth limit
296+
max_depth = 0 # 0 = unlimited (rate limits & dedup control breadth)
292297

293298
[network]
294299
upload_limit_mbps = 5 # P2P upload bandwidth cap
@@ -331,7 +336,7 @@ infomesh search --local "query" # Local-only search
331336

332337
# Management
333338
infomesh config show # Display current configuration
334-
infomesh config set crawl.max_depth 5
339+
infomesh config set crawl.max_depth 10 # set hard depth limit (0=unlimited)
335340
infomesh keys export # Export keys for backup
336341
infomesh keys rotate # Rotate node identity key
337342

β€Ždocs/en/09-console-dashboard.mdβ€Ž

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,8 @@ mobile terminal apps (Termux, Blink, etc.), and low-spec server environments.
2424
β”‚ β”‚ State: 🟒 Running β”‚ β”‚ RAM: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘ 62% β”‚ β”‚
2525
β”‚ β”‚ Uptime: 3d 14h 22m β”‚ β”‚ Disk: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘ 81% β”‚ β”‚
2626
β”‚ β”‚ Version: 0.1.0 β”‚ β”‚ Net↑: 2.1/5.0 Mbps β”‚ β”‚
27-
β”‚ β”‚ Data dir: ~/.info... β”‚ β”‚ Net↓: 4.3/10.0 Mbps β”‚ β”‚
27+
β”‚ β”‚ GitHub: user@e... β”‚ β”‚ Net↓: 4.3/10.0 Mbps β”‚ β”‚
28+
β”‚ β”‚ Data dir: ~/.info... β”‚ β”‚ β”‚ β”‚
2829
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
2930
β”‚ β”‚
3031
β”‚ β”Œβ”€ Activity (last 1h) ──────────────────────────────┐ β”‚
@@ -44,6 +45,8 @@ mobile terminal apps (Termux, Blink, etc.), and low-spec server environments.
4445
```
4546

4647
> **Implementation Notes**: NodeInfoPanel shows Data dir instead of Peers.
48+
> GitHub email is auto-detected from `git config user.email` and shown if available;
49+
> displayed as `not connected` otherwise. The value is resolved once and cached.
4750
> ResourcePanel displays CPU/RAM when `psutil` is installed, N/A otherwise.
4851
> Resource bar colors auto-switch based on usage (β‰₯90% red, β‰₯70% yellow).
4952

β€Ždocs/en/10-mcp-integration.mdβ€Ž

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ so your AI assistant can search the web through your own decentralized index.
1717
| `search` | Search the P2P network (local + distributed) | `query` (string), `limit` (int, default 10) |
1818
| `search_local` | Search local index only (works offline) | `query` (string), `limit` (int, default 10) |
1919
| `fetch_page` | Fetch full text of a URL (cached or live) | `url` (string) |
20-
| `crawl_url` | Crawl a URL and add to the index | `url` (string), `depth` (int, default 0, max 3) |
20+
| `crawl_url` | Crawl a URL and add to the index | `url` (string), `depth` (int, default 0), `force` (bool, default false) |
2121
| `network_stats` | Node status: index size, peers, credits | *(none)* |
2222

2323
---

0 commit comments

Comments
Β (0)