Skip to content

Title: Search does not work for non-English languages (Russian, etc.) — default embedding model is English-only #712

@Speccy-Rom

Description

@Speccy-Rom

Problem

▎ MemPalace uses ChromaDB with the default embedding function (all-MiniLM-L6-v2 from sentence-transformers). This model is trained primarily on English text and produces near-random vectors for non-English
▎ languages (Russian, Chinese, Arabic, etc.).

▎ As a result, semantic search queries in Russian return irrelevant results with very low similarity scores (0.04–0.18), making the tool practically unusable for non-English projects or users who store
▎ memories in their native language.

▎ Steps to reproduce
▎ 1. Mine a project that contains Russian-language text, or store a drawer with Russian content via mempalace_add_drawer
▎ 2. Search for it in Russian: mempalace search "запрос на русском"
▎ 3. Results are unrelated, similarity scores < 0.2

▎ Expected behavior

▎ Search should work across languages, or at minimum there should be a way to configure a multilingual embedding model.

▎ Suggested fix

▎ Allow configuring the embedding model in ~/.mempalace/config.json, defaulting to a multilingual model such as paraphrase-multilingual-mpnet-base-v2 (supports 50+ languages, similar size to all-MiniLM-L6-v2).
▎ Alternatively, use multilingual-e5-small for a lighter option.

▎ {
▎ "embedding_model": "paraphrase-multilingual-mpnet-base-v2"
▎ }

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/i18nMultilingual, Unicode, non-English embeddingsarea/searchSearch and retrievalenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions