This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
karukan is a Linux Japanese Input Method system consisting of three Rust crates:
- karukan-engine: Core library — romaji-to-hiragana conversion, neural kana-kanji conversion via llama.cpp, system dictionary, learning cache
- karukan-cli: CLI tools and server — dictionary builder, Sudachi converter, dict viewer, AJIMEE-Bench, HTTP API server
- karukan-im: fcitx5 IME addon using karukan-engine for Japanese input on Linux
This project uses a Cargo workspace. All commands are run from the repository root.
cargo build --release # Build all crates
cargo test --workspace # Run all testscargo build -p karukan-engine --release
cargo test -p karukan-engine # includes integration tests (model auto-downloaded on first run)cargo build -p karukan-cli --release
# Start the server (auto-downloads models from HuggingFace)
cargo run --release --bin karukan-server
# Build dictionary from JSON or Mozc TSV
cargo run --release --bin karukan-dict -- build input.json -o dict.bin
# Build scored dictionary from Sudachi CSV
cargo run --release --bin sudachi-dict -- input.csv -o scored.json
# Dictionary viewer (web UI + CLI search)
cargo run --release --bin karukan-dict -- view dict.bin
# AJIMEE-Bench evaluation
cargo run --release --bin ajimee-bench -- evaluation_items.jsoncargo build -p karukan-im --release
cargo test -p karukan-im
# Build and install fcitx5 addon
cd karukan-im/fcitx5-addon
# Option A: System install (sudo required, no FCITX_ADDON_DIRS needed)
cmake -B build -DCMAKE_INSTALL_PREFIX=/usr
cmake --build build -j
sudo cmake --install build
# Option B: User-local install (no sudo, requires FCITX_ADDON_DIRS)
cmake -B build -DCMAKE_INSTALL_PREFIX=$HOME/.local
cmake --build build -j
cmake --install buildcargo fmt --all # Format all crates
cargo clippy --workspace # Lint all crateslib.rs— Library entry point and re-exportsromaji/— Romaji-to-hiragana conversiontrie.rs— Trie data structurerules.rs— 200+ conversion ruleconverter.rs— FSM converter
kanji/— Kana-kanji conversion via llama.cppbackend.rs— Backend + KanaKanjiConverterllamacpp.rs— GGUF inferencehf_download.rs— HuggingFace model downloadmodel_config.rs— models.toml registryerror.rs— KanjiError type
dict.rs— Double-array trie system dictionarylearning.rs— Learning cache (user conversion history, TSV persistence, recency+frequency scoring)kana.rs— Hiragana/katakana utilities
bin/dict.rs— Dictionary tool: build (JSON or Mozc TSV → binary) and view (web UI + CLI search)bin/sudachi_dict.rs— Sudachi dictionary → scored JSON converterbin/server.rs— Axum HTTP API serverbin/ajimee_bench.rs— AJIMEE-Bench evaluationstatic/— Web UI assets for server and dict-viewer
core/engine/— IMEEngine state machine (Empty → Composing → Conversion)mod.rs— Main InputMethodEngine struct and core processing logictypes.rs— EngineConfig, EngineResult, EngineAction, Converters, ConversionStrategyinput.rs— Key input handling for Composing stateinput_buffer.rs— Input buffer (hiragana text + cursor position)conversion.rs— Conversion mode handlingcursor.rs— Cursor movementdisplay.rs— Preedit text displaymode.rs— Mode switching (katakana, alphabet, live conversion)init.rs— Model loading, dictionary setup, learning cache initstrategy.rs— Conversion strategy determination and adaptive model selectiontests.rs— Engine unit tests
core/preedit.rs— Preedit composition with cursor supportcore/candidate.rs— Candidate list with pagination supportcore/keycode.rs— Key symbol definitions and key event handlingcore/state.rs— Engine state definitionsconfig/settings.rs— User settings (~/.config/karukan-im/config.toml)ffi.rs— C FFI for fcitx5 C++ addonfcitx5-addon/src/karukan.cpp— C++ fcitx5 wrapper
- IMEEngine uses a state machine: Empty → Composing → Conversion
input_buf: InputBufferin IMEEngine is the source of truth for hiragana text (.textfield holds the composed hiragana,.cursor_postracks cursor position)- RomajiConverter accumulates output; consumed into input_buf via delta tracking
- Models use jinen format with special Unicode tokens (U+EE00–U+EE02) from the Private Use Area; model input is katakana (hiragana is converted to katakana before inference)
- Model registry defined in
karukan-engine/models.toml; default models use Q5_K_M quantization - Learning cache records user-selected conversions and boosts them on subsequent conversions; candidate priority: Learning → User Dictionary → Model → System Dictionary → Fallback
- Learning cache is persisted as TSV (
~/.local/share/karukan-im/learning.tsv); saved on deactivate and engine free, not on every commit - Learning score uses recency-weighted formula (mozc-inspired):
recency * 10.0 + ln(1 + frequency); eviction removes lowest-score entries when overmax_entries(default: 10,000)
Model training is handled by the separate karukan-jinen Python project (not in this repository). It trains GPT-2 based models for kana-kanji conversion using the jinen format, and outputs GGUF files for use with karukan-engine.