earcher.py does not use i18n regex patterns — Japanese (and other non-English) search is degraded

# searcher.py does not use i18n regex patterns — Japanese (and other non-English) search is degraded

## Summary

`searcher.py` does not reference the `i18n` module at all. The BM25 tokenizer uses a hardcoded regex (`\w{2,}`) and does not apply language-specific stop words or topic patterns from `i18n/*.json`. This means the Japanese stop words, quote patterns, and action patterns defined in `ja.json` are unused during search, degrading search quality for Japanese content.

## Environment

- MemPalace: v3.3.0
- Python: 3.11
- OS: Windows 11
- Palace content: ~2,700 drawers, primarily Japanese conversation logs

## Steps to Reproduce

1. Mine Japanese conversation logs with `mempalace mine <dir> --mode convos`
2. Search for a specific Japanese term: `mempalace search "レッサーパンダ"`
3. Observe that results include unrelated content with high match scores, while relevant results (containing the exact term) are ranked alongside noise

## Expected Behavior

- BM25 should strip Japanese stop words (は, が, を, に, で, etc.) defined in `ja.json` before scoring
- Exact keyword matches in Japanese should be strongly boosted
- Language-specific regex patterns (`topic_pattern`, `action_pattern`) should be used for closet generation and search

## Actual Behavior

- `searcher.py` does not import or call anything from `mempalace.i18n`
  - Verified with: `'i18n' in text` → `False`, `'load_lang' in text` → `False`
- The tokenizer `_TOKEN_RE = re.compile(r"\w{2,}", re.UNICODE)` handles Unicode but does not benefit from language-specific stop word removal
- Japanese stop words remain in the token stream, diluting BM25 IDF scores
- `i18n/__init__.py` defaults to English on import (`load_lang("en")`) and `config.py` has no `lang` property to override this

## Suggested Fix

1. Add a `lang` field to `config.json` (e.g., `"lang": "ja"`)
2. Have `MempalaceConfig` expose this field and call `load_lang()` during initialization
3. In `searcher.py`, use `i18n.get_regex()["stop_words"]` to filter tokens in `_tokenize()` before BM25 scoring
4. Optionally use `topic_pattern` and `action_pattern` for closet generation in the relevant modules

## Additional Context

The `i18n/ja.json` file is well-structured and contains useful patterns. The infrastructure is all there — it just needs to be wired into the search pipeline. This would also benefit the other 7 supported languages (fr, ko, es, de, zh-CN, zh-TW, en).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

earcher.py does not use i18n regex patterns — Japanese (and other non-English) search is degraded #973

searcher.py does not use i18n regex patterns — Japanese (and other non-English) search is degraded

Summary

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Suggested Fix

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

earcher.py does not use i18n regex patterns — Japanese (and other non-English) search is degraded #973

Description

searcher.py does not use i18n regex patterns — Japanese (and other non-English) search is degraded

Summary

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Suggested Fix

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions