[Bug] NGramGPULanguageModel.from_arpa() fails with AssertionError on valid 6-gram ARPA files
Environment
- NeMo version: 2.7.3
- PyTorch: 2.x (CUDA)
- GPU: NVIDIA H100 80GB
- OS: Ubuntu 22.04
- KenLM: Built from source (latest)
Description
NGramGPULanguageModel.from_arpa() crashes with an AssertionError when loading valid 6-gram ARPA language models generated by KenLM's lmplz. The same ARPA files load and query successfully with the kenlm Python bindings.
This blocks shallow fusion for domain-adapted ASR — a key technique for improving WER in specialized domains like Air Traffic Control.
Steps to Reproduce
import nemo.collections.asr as nemo_asr
# Load any ASR model (e.g., Parakeet TDT)
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")
# Build a 6-gram ARPA LM with KenLM
# lmplz -o 6 --prune 0 0 1 1 2 2 < corpus.txt > my_6gram.arpa
# Try to use it for shallow fusion
from omegaconf import open_dict
with open_dict(model.cfg.decoding):
model.cfg.decoding.strategy = "tsd"
model.cfg.decoding.beam = {
"beam_size": 8,
"search_type": "tsd",
"ngram_lm_model": "/path/to/my_6gram.arpa",
"ngram_lm_alpha": 0.3,
"return_best_hypothesis": True,
}
model.change_decoding_strategy(model.cfg.decoding)
# ^ crashes here
Error
Traceback (most recent call last):
File ".../nemo/collections/asr/parts/submodules/ngram_lm_batched.py", line 440, in _add_ngram
assert len(ngram.symbols) == self._cur_order
AssertionError
The assertion in _add_ngram expects each n-gram entry to have exactly self._cur_order symbols, but the parser appears to misparse some entries — likely due to handling of backoff weights, whitespace, or end-of-section markers in the ARPA format.
What We Tried
- Binary
.bin files → UnicodeDecodeError (expected — from_arpa wants text)
- ARPA with prepended KenLM build log →
AssertionError: assert line == "\\data\\" (fixed by stripping header)
- Clean ARPA files (valid
\data\ header, correct format) → AssertionError: len(ngram.symbols) == self._cur_order
- Multiple ARPA files of different sizes (27K, 82K, 127K sentences) — all fail the same way
- Verified files are valid with
kenlm Python bindings:
import kenlm
lm = kenlm.Model("my_6gram.arpa") # loads fine
lm.score("DELTA FIVE SEVEN CLIMB FLIGHT LEVEL THREE FIVE ZERO") # works
Expected Behavior
NGramGPULanguageModel.from_arpa() should successfully parse valid KenLM-generated 6-gram ARPA files and enable shallow fusion during beam search decoding.
Suspected Root Cause
The ARPA parser in ngram_lm_batched.py may have been tested primarily with lower-order n-grams (3-gram, 4-gram). With 6-gram models:
- Lines in higher-order sections may have edge cases (e.g., backoff weight formatting, empty lines between sections) that the parser doesn't handle
- The symbol count check may fail due to how the parser splits n-gram lines containing tab-separated probability, words, and backoff weight
Impact
This bug blocks shallow fusion for anyone doing domain-adapted ASR with higher-order language models. In our case, we built a 127K-sentence ATC domain LM to improve WER from 6.09% toward sub-6% on Air Traffic Control transcription. KenLM shallow fusion typically provides 5-15% relative WER improvement in domain-specific ASR, but we cannot use it due to this parser bug.
Workaround
None currently for GPU-accelerated shallow fusion within NeMo. The kenlm Python bindings can load the same files for post-hoc rescoring, but this doesn't integrate with NeMo's beam search decoding pipeline.
Related Files
nemo/collections/asr/parts/submodules/ngram_lm_batched.py — _add_ngram() method, line ~440
nemo/collections/asr/parts/submodules/ngram_lm_batched.py — from_arpa() class method
[Bug]
NGramGPULanguageModel.from_arpa()fails with AssertionError on valid 6-gram ARPA filesEnvironment
Description
NGramGPULanguageModel.from_arpa()crashes with anAssertionErrorwhen loading valid 6-gram ARPA language models generated by KenLM'slmplz. The same ARPA files load and query successfully with thekenlmPython bindings.This blocks shallow fusion for domain-adapted ASR — a key technique for improving WER in specialized domains like Air Traffic Control.
Steps to Reproduce
Error
The assertion in
_add_ngramexpects each n-gram entry to have exactlyself._cur_ordersymbols, but the parser appears to misparse some entries — likely due to handling of backoff weights, whitespace, or end-of-section markers in the ARPA format.What We Tried
.binfiles →UnicodeDecodeError(expected —from_arpawants text)AssertionError: assert line == "\\data\\"(fixed by stripping header)\data\header, correct format) →AssertionError: len(ngram.symbols) == self._cur_orderkenlmPython bindings:Expected Behavior
NGramGPULanguageModel.from_arpa()should successfully parse valid KenLM-generated 6-gram ARPA files and enable shallow fusion during beam search decoding.Suspected Root Cause
The ARPA parser in
ngram_lm_batched.pymay have been tested primarily with lower-order n-grams (3-gram, 4-gram). With 6-gram models:Impact
This bug blocks shallow fusion for anyone doing domain-adapted ASR with higher-order language models. In our case, we built a 127K-sentence ATC domain LM to improve WER from 6.09% toward sub-6% on Air Traffic Control transcription. KenLM shallow fusion typically provides 5-15% relative WER improvement in domain-specific ASR, but we cannot use it due to this parser bug.
Workaround
None currently for GPU-accelerated shallow fusion within NeMo. The
kenlmPython bindings can load the same files for post-hoc rescoring, but this doesn't integrate with NeMo's beam search decoding pipeline.Related Files
nemo/collections/asr/parts/submodules/ngram_lm_batched.py—_add_ngram()method, line ~440nemo/collections/asr/parts/submodules/ngram_lm_batched.py—from_arpa()class method