You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add shared utilities for dashboard and integration tests
- Implemented utility functions in `utils.py` for formatting uptime, checking node status, and reading P2P status.
- Created an integration test suite in `integration_dashboard.py` to validate functionality across all dashboard tabs, including overview, crawl, search, network, credits, and settings.
- Added checks for UI components, data rendering, and key bindings, ensuring robust testing of the dashboard interface.
@@ -184,6 +184,7 @@ Every module, class, and function must have **one clear responsibility**.
184
184
- Extract repeated patterns (e.g., "crawl β index β optionally vector-index") into named helper functions.
185
185
-**CLI commands**: Thin wrappers that delegate to library code. No business logic in Click handlers β they should only parse arguments, call library functions, and format output.
186
186
-**MCP tool handlers**: Same as CLI β dispatch to service-layer functions, don't inline business logic.
187
+
-**Dashboard panels**: Read data from caches or public APIs. Never access private attributes (`_conn`, `_db`) of library classes.
187
188
188
189
When reviewing code, ask: _"If I change X, what else breaks?"_ If the answer includes unrelated concerns, the code violates SRP and should be refactored.
189
190
@@ -194,6 +195,36 @@ When reviewing code, ask: _"If I change X, what else breaks?"_ If the answer inc
194
195
- Import order: stdlib β third-party β local (enforced by ruff/isort).
195
196
- Prefer `pathlib.Path` over `os.path`.
196
197
198
+
### CI Failure Prevention (Lessons Learned)
199
+
200
+
The following errors have caused CI failures. **Always check for these before committing:**
201
+
202
+
| Error Code | Description | Prevention |
203
+
|------------|-------------|------------|
204
+
|**E501**| Line too long (>88 chars) | Run `ruff format .` before commit. For Click `help=` strings, use multi-line concatenation: `help=("line1 " "line2")`. For long f-strings, break into variables first. |
205
+
|**I001**| Import block unsorted | Always group imports: stdlib β third-party β local, alphabetically within each group. Run `ruff check --fix` to auto-sort. Never add `import time` below `from dataclasses import dataclass`. |
206
+
|**F541**| f-string without placeholders | Don't write `f"plain string"` β remove the `f` prefix if there are no `{β¦}` expressions. |
207
+
|**F841**| Local variable assigned but never used | Remove unused variables or prefix with `_` if intentionally unused (e.g., `_unused = func()`). |
208
+
|**F401**| Module imported but unused | Remove unused imports. If imported for side effects or re-export, add `# noqa: F401`. |
209
+
|**F821**| Undefined name used | Ensure all referenced names are imported or defined. Check spelling of variable names. |
210
+
211
+
**Common pitfalls:**
212
+
- Adding a new `import` at the end of an import block instead of in alphabetical order β **I001**.
213
+
- Writing Click `help="..."` strings that exceed 88 chars β **E501**. Split into `help=("part1 " "part2")`.
214
+
- Copy-pasting code with f-strings but removing the interpolated variables β **F541**.
215
+
- Forgetting to remove debug `import subprocess` or `import pdb` β **F401**.
216
+
217
+
### No Private API Access in Consumers
218
+
219
+
Library classes (`LocalStore`, `CreditLedger`, etc.) expose public methods for data access. **Never** access private attributes like `store._conn` or `ledger._conn` in CLI, MCP, or dashboard code. If a needed query doesn't have a public API, add one to the library class first.
220
+
221
+
### Shared Utilities β No Duplication
222
+
223
+
Utility functions must exist in exactly one place:
224
+
-**Dashboard utilities** (`_format_uptime`, `_get_peer_id`, `_is_node_running`, `_read_p2p_status`): use `infomesh/dashboard/utils.py`.
225
+
-**Domain-extraction SQL**: use `LocalStore.get_top_domains()` β don't inline raw SQL in dashboard code.
226
+
-**Node status assembly** (store stats + P2P status + credit stats): use `services.py` orchestration β don't duplicate across CLI, MCP, and dashboard.
227
+
197
228
### Pre-commit Checks (Required)
198
229
199
230
Before every commit, **both lint and format checks must pass**:
@@ -259,6 +290,27 @@ If lint errors are found, fix them before committing:
259
290
-`uv run pytest` β run tests.
260
291
-`uv run infomesh start` β run the application.
261
292
293
+
### Documentation Sync (Required)
294
+
295
+
Every code change that affects **user-facing behavior, API surface, or configuration** must be accompanied by corresponding documentation updates. Do not consider a task complete until all relevant docs are updated.
-**Bilingual**: All documentation exists in both English (`docs/en/`) and Korean (`docs/ko/`). Both must be updated simultaneously.
310
+
-**copilot-instructions.md**: This file is the single source of truth for AI assistants. Keep it synchronized with the actual codebase behavior.
311
+
-**Commit message**: Use the `docs:` prefix for documentation-only changes. When a feature commit includes doc updates, use `feat:` (the docs update is part of the feature).
312
+
-**Checklist**: Before marking a task complete, verify: (1) EN docs updated, (2) KO docs updated, (3) copilot-instructions updated if applicable.
313
+
262
314
## Architecture Guidelines
263
315
264
316
### P2P / DHT
@@ -273,14 +325,19 @@ If lint errors are found, fix them before committing:
- Default politeness: β€1 request/second per domain.
328
+
-**Crawl-Delay**: Honors the `Crawl-delay` directive in robots.txt. Per-domain delay is applied automatically and capped at 60 seconds.
329
+
-**Sitemap discovery**: Extracts `Sitemap:` URLs from robots.txt and automatically schedules discovered URLs for crawling.
330
+
-**Canonical tag**: Recognizes `<link rel="canonical">`. If a page declares a different canonical URL, the crawler skips indexing and schedules the canonical URL instead.
331
+
-**Retry with backoff**: Transient HTTP 5xx errors and network failures trigger up to 2 retries with exponential backoff (1s, 2s). SSRF-blocked URLs are never retried.
276
332
- Use `trafilatura` for content extraction. If trafilatura returns `None`, skip the page.
277
333
- Store raw text + metadata (title, URL, crawl timestamp, language).
278
334
-**Seed strategy**: Bundled curated seed lists by category (tech docs, academic, encyclopedia, etc.) + Common Crawl URL import + DHT-assigned URLs + user `crawl_url()` submissions + link following.
-**Crawl lock**: Before crawling, publish `hash(url) = CRAWLING` to DHT to prevent race conditions. Timeout after 5 minutes.
281
337
-**SPA/JS rendering**: Phase 0 focuses on static HTML. For JS-heavy pages, use `js_required` DHT tag to delegate to Playwright-capable nodes (Phase 4).
282
338
-**Bandwidth limits**: Default β€5 Mbps upload / 10 Mbps download for P2P. Configurable via `~/.infomesh/config.toml`. Max 5 concurrent crawl connections per node.
283
-
-**`crawl_url()` rate limiting**: 60 URLs/hr per node, 10 pending URLs/domain, max depth=3.
339
+
-**`crawl_url()` rate limiting**: 60 URLs/hr per node, 10 pending URLs/domain, depth unlimited by default (0=unlimited, configurable).
340
+
-**Force re-crawl**: `crawl_url(url, force=True)` bypasses URL dedup to re-crawl previously visited pages. Useful for refreshing stale content or discovering new child links after depth limits were changed.
284
341
285
342
### Indexing
286
343
@@ -306,7 +363,7 @@ The MCP server exposes these tools:
306
363
|`search(query, limit)`| Full network search, merges local + remote results |
Copy file name to clipboardExpand all lines: docs/en/02-architecture.md
+7-2Lines changed: 7 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -101,11 +101,16 @@ Workload distribution:
101
101
Crawling rules:
102
102
- Always strictly respect `robots.txt`
103
103
- Default politeness: β€1 request/second per domain
104
+
-**Crawl-Delay**: Honors the `Crawl-delay` directive in robots.txt. Per-domain delay is applied automatically and capped at 60 seconds to prevent abuse.
105
+
-**Sitemap discovery**: Extracts `Sitemap:` URLs from robots.txt and automatically schedules discovered URLs for crawling.
106
+
-**Canonical tag**: Recognizes `<link rel="canonical">` in HTML. If a page declares a different canonical URL, the crawler skips indexing the current page and schedules the canonical URL instead β preventing duplicate content in the index.
107
+
-**Retry with backoff**: Transient HTTP errors (5xx) and network failures trigger automatic retries (up to 2 retries with exponential backoff: 1s, 2s). SSRF-blocked URLs are never retried.
- Storage: raw text + metadata (title, URL, crawl timestamp, language)
106
110
-**Crawl lock**: Before crawling, publish `hash(url) = CRAWLING` to DHT to prevent multiple nodes crawling the same URL. Lock timeout: 5 minutes.
107
111
-**SPA/JS rendering**: Most content is extractable from static HTML. For JavaScript-heavy pages, a `js_required` DHT tag triggers delegation to nodes with Playwright/headless browser capability. Phase 0 (MVP) focuses on static HTML only.
108
112
-**Bandwidth limits**: Default β€5 Mbps upload / 10 Mbps download for P2P traffic. Configurable via `~/.infomesh/config.toml`. Crawl concurrency: max 5 simultaneous connections per node (adjustable).
113
+
-**Force re-crawl**: `crawl_url(url, force=True)` bypasses URL dedup to re-crawl previously visited pages. Useful for refreshing stale content or discovering new child links after depth limits were changed.
0 commit comments