Skip to content

earcher.py does not use i18n regex patterns — Japanese (and other non-English) search is degraded #973

@m2a8-sketch

Description

@m2a8-sketch

searcher.py does not use i18n regex patterns — Japanese (and other non-English) search is degraded

Summary

searcher.py does not reference the i18n module at all. The BM25 tokenizer uses a hardcoded regex (\w{2,}) and does not apply language-specific stop words or topic patterns from i18n/*.json. This means the Japanese stop words, quote patterns, and action patterns defined in ja.json are unused during search, degrading search quality for Japanese content.

Environment

  • MemPalace: v3.3.0
  • Python: 3.11
  • OS: Windows 11
  • Palace content: ~2,700 drawers, primarily Japanese conversation logs

Steps to Reproduce

  1. Mine Japanese conversation logs with mempalace mine <dir> --mode convos
  2. Search for a specific Japanese term: mempalace search "レッサーパンダ"
  3. Observe that results include unrelated content with high match scores, while relevant results (containing the exact term) are ranked alongside noise

Expected Behavior

  • BM25 should strip Japanese stop words (は, が, を, に, で, etc.) defined in ja.json before scoring
  • Exact keyword matches in Japanese should be strongly boosted
  • Language-specific regex patterns (topic_pattern, action_pattern) should be used for closet generation and search

Actual Behavior

  • searcher.py does not import or call anything from mempalace.i18n
    • Verified with: 'i18n' in textFalse, 'load_lang' in textFalse
  • The tokenizer _TOKEN_RE = re.compile(r"\w{2,}", re.UNICODE) handles Unicode but does not benefit from language-specific stop word removal
  • Japanese stop words remain in the token stream, diluting BM25 IDF scores
  • i18n/__init__.py defaults to English on import (load_lang("en")) and config.py has no lang property to override this

Suggested Fix

  1. Add a lang field to config.json (e.g., "lang": "ja")
  2. Have MempalaceConfig expose this field and call load_lang() during initialization
  3. In searcher.py, use i18n.get_regex()["stop_words"] to filter tokens in _tokenize() before BM25 scoring
  4. Optionally use topic_pattern and action_pattern for closet generation in the relevant modules

Additional Context

The i18n/ja.json file is well-structured and contains useful patterns. The infrastructure is all there — it just needs to be wired into the search pipeline. This would also benefit the other 7 supported languages (fr, ko, es, de, zh-CN, zh-TW, en).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions