searcher.py does not use i18n regex patterns — Japanese (and other non-English) search is degraded
Summary
searcher.py does not reference the i18n module at all. The BM25 tokenizer uses a hardcoded regex (\w{2,}) and does not apply language-specific stop words or topic patterns from i18n/*.json. This means the Japanese stop words, quote patterns, and action patterns defined in ja.json are unused during search, degrading search quality for Japanese content.
Environment
- MemPalace: v3.3.0
- Python: 3.11
- OS: Windows 11
- Palace content: ~2,700 drawers, primarily Japanese conversation logs
Steps to Reproduce
- Mine Japanese conversation logs with
mempalace mine <dir> --mode convos
- Search for a specific Japanese term:
mempalace search "レッサーパンダ"
- Observe that results include unrelated content with high match scores, while relevant results (containing the exact term) are ranked alongside noise
Expected Behavior
- BM25 should strip Japanese stop words (は, が, を, に, で, etc.) defined in
ja.json before scoring
- Exact keyword matches in Japanese should be strongly boosted
- Language-specific regex patterns (
topic_pattern, action_pattern) should be used for closet generation and search
Actual Behavior
searcher.py does not import or call anything from mempalace.i18n
- Verified with:
'i18n' in text → False, 'load_lang' in text → False
- The tokenizer
_TOKEN_RE = re.compile(r"\w{2,}", re.UNICODE) handles Unicode but does not benefit from language-specific stop word removal
- Japanese stop words remain in the token stream, diluting BM25 IDF scores
i18n/__init__.py defaults to English on import (load_lang("en")) and config.py has no lang property to override this
Suggested Fix
- Add a
lang field to config.json (e.g., "lang": "ja")
- Have
MempalaceConfig expose this field and call load_lang() during initialization
- In
searcher.py, use i18n.get_regex()["stop_words"] to filter tokens in _tokenize() before BM25 scoring
- Optionally use
topic_pattern and action_pattern for closet generation in the relevant modules
Additional Context
The i18n/ja.json file is well-structured and contains useful patterns. The infrastructure is all there — it just needs to be wired into the search pipeline. This would also benefit the other 7 supported languages (fr, ko, es, de, zh-CN, zh-TW, en).
searcher.py does not use i18n regex patterns — Japanese (and other non-English) search is degraded
Summary
searcher.pydoes not reference thei18nmodule at all. The BM25 tokenizer uses a hardcoded regex (\w{2,}) and does not apply language-specific stop words or topic patterns fromi18n/*.json. This means the Japanese stop words, quote patterns, and action patterns defined inja.jsonare unused during search, degrading search quality for Japanese content.Environment
Steps to Reproduce
mempalace mine <dir> --mode convosmempalace search "レッサーパンダ"Expected Behavior
ja.jsonbefore scoringtopic_pattern,action_pattern) should be used for closet generation and searchActual Behavior
searcher.pydoes not import or call anything frommempalace.i18n'i18n' in text→False,'load_lang' in text→False_TOKEN_RE = re.compile(r"\w{2,}", re.UNICODE)handles Unicode but does not benefit from language-specific stop word removali18n/__init__.pydefaults to English on import (load_lang("en")) andconfig.pyhas nolangproperty to override thisSuggested Fix
langfield toconfig.json(e.g.,"lang": "ja")MempalaceConfigexpose this field and callload_lang()during initializationsearcher.py, usei18n.get_regex()["stop_words"]to filter tokens in_tokenize()before BM25 scoringtopic_patternandaction_patternfor closet generation in the relevant modulesAdditional Context
The
i18n/ja.jsonfile is well-structured and contains useful patterns. The infrastructure is all there — it just needs to be wired into the search pipeline. This would also benefit the other 7 supported languages (fr, ko, es, de, zh-CN, zh-TW, en).