Skip to content

[Bug] NGramGPULanguageModel.from_arpa() fails with AssertionError on valid 6-gram ARPA files #15715

@amargolin78

Description

@amargolin78

[Bug] NGramGPULanguageModel.from_arpa() fails with AssertionError on valid 6-gram ARPA files

Environment

  • NeMo version: 2.7.3
  • PyTorch: 2.x (CUDA)
  • GPU: NVIDIA H100 80GB
  • OS: Ubuntu 22.04
  • KenLM: Built from source (latest)

Description

NGramGPULanguageModel.from_arpa() crashes with an AssertionError when loading valid 6-gram ARPA language models generated by KenLM's lmplz. The same ARPA files load and query successfully with the kenlm Python bindings.

This blocks shallow fusion for domain-adapted ASR — a key technique for improving WER in specialized domains like Air Traffic Control.

Steps to Reproduce

import nemo.collections.asr as nemo_asr

# Load any ASR model (e.g., Parakeet TDT)
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")

# Build a 6-gram ARPA LM with KenLM
# lmplz -o 6 --prune 0 0 1 1 2 2 < corpus.txt > my_6gram.arpa

# Try to use it for shallow fusion
from omegaconf import open_dict
with open_dict(model.cfg.decoding):
    model.cfg.decoding.strategy = "tsd"
    model.cfg.decoding.beam = {
        "beam_size": 8,
        "search_type": "tsd",
        "ngram_lm_model": "/path/to/my_6gram.arpa",
        "ngram_lm_alpha": 0.3,
        "return_best_hypothesis": True,
    }
model.change_decoding_strategy(model.cfg.decoding)
# ^ crashes here

Error

Traceback (most recent call last):
  File ".../nemo/collections/asr/parts/submodules/ngram_lm_batched.py", line 440, in _add_ngram
    assert len(ngram.symbols) == self._cur_order
AssertionError

The assertion in _add_ngram expects each n-gram entry to have exactly self._cur_order symbols, but the parser appears to misparse some entries — likely due to handling of backoff weights, whitespace, or end-of-section markers in the ARPA format.

What We Tried

  1. Binary .bin filesUnicodeDecodeError (expected — from_arpa wants text)
  2. ARPA with prepended KenLM build logAssertionError: assert line == "\\data\\" (fixed by stripping header)
  3. Clean ARPA files (valid \data\ header, correct format) → AssertionError: len(ngram.symbols) == self._cur_order
  4. Multiple ARPA files of different sizes (27K, 82K, 127K sentences) — all fail the same way
  5. Verified files are valid with kenlm Python bindings:
    import kenlm
    lm = kenlm.Model("my_6gram.arpa")  # loads fine
    lm.score("DELTA FIVE SEVEN CLIMB FLIGHT LEVEL THREE FIVE ZERO")  # works

Expected Behavior

NGramGPULanguageModel.from_arpa() should successfully parse valid KenLM-generated 6-gram ARPA files and enable shallow fusion during beam search decoding.

Suspected Root Cause

The ARPA parser in ngram_lm_batched.py may have been tested primarily with lower-order n-grams (3-gram, 4-gram). With 6-gram models:

  • Lines in higher-order sections may have edge cases (e.g., backoff weight formatting, empty lines between sections) that the parser doesn't handle
  • The symbol count check may fail due to how the parser splits n-gram lines containing tab-separated probability, words, and backoff weight

Impact

This bug blocks shallow fusion for anyone doing domain-adapted ASR with higher-order language models. In our case, we built a 127K-sentence ATC domain LM to improve WER from 6.09% toward sub-6% on Air Traffic Control transcription. KenLM shallow fusion typically provides 5-15% relative WER improvement in domain-specific ASR, but we cannot use it due to this parser bug.

Workaround

None currently for GPU-accelerated shallow fusion within NeMo. The kenlm Python bindings can load the same files for post-hoc rescoring, but this doesn't integrate with NeMo's beam search decoding pipeline.

Related Files

  • nemo/collections/asr/parts/submodules/ngram_lm_batched.py_add_ngram() method, line ~440
  • nemo/collections/asr/parts/submodules/ngram_lm_batched.pyfrom_arpa() class method

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions