[Bug] NGramGPULanguageModel.from_arpa() fails with AssertionError on valid 6-gram ARPA files

# [Bug] `NGramGPULanguageModel.from_arpa()` fails with AssertionError on valid 6-gram ARPA files

## Environment

- **NeMo version**: 2.7.3
- **PyTorch**: 2.x (CUDA)
- **GPU**: NVIDIA H100 80GB
- **OS**: Ubuntu 22.04
- **KenLM**: Built from source (latest)

## Description

`NGramGPULanguageModel.from_arpa()` crashes with an `AssertionError` when loading valid 6-gram ARPA language models generated by KenLM's `lmplz`. The same ARPA files load and query successfully with the `kenlm` Python bindings.

This blocks **shallow fusion for domain-adapted ASR** — a key technique for improving WER in specialized domains like Air Traffic Control.

## Steps to Reproduce

```python
import nemo.collections.asr as nemo_asr

# Load any ASR model (e.g., Parakeet TDT)
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")

# Build a 6-gram ARPA LM with KenLM
# lmplz -o 6 --prune 0 0 1 1 2 2 < corpus.txt > my_6gram.arpa

# Try to use it for shallow fusion
from omegaconf import open_dict
with open_dict(model.cfg.decoding):
    model.cfg.decoding.strategy = "tsd"
    model.cfg.decoding.beam = {
        "beam_size": 8,
        "search_type": "tsd",
        "ngram_lm_model": "/path/to/my_6gram.arpa",
        "ngram_lm_alpha": 0.3,
        "return_best_hypothesis": True,
    }
model.change_decoding_strategy(model.cfg.decoding)
# ^ crashes here
```

## Error

```
Traceback (most recent call last):
  File ".../nemo/collections/asr/parts/submodules/ngram_lm_batched.py", line 440, in _add_ngram
    assert len(ngram.symbols) == self._cur_order
AssertionError
```

The assertion in `_add_ngram` expects each n-gram entry to have exactly `self._cur_order` symbols, but the parser appears to misparse some entries — likely due to handling of backoff weights, whitespace, or end-of-section markers in the ARPA format.

## What We Tried

1. **Binary `.bin` files** → `UnicodeDecodeError` (expected — `from_arpa` wants text)
2. **ARPA with prepended KenLM build log** → `AssertionError: assert line == "\\data\\"` (fixed by stripping header)
3. **Clean ARPA files** (valid `\data\` header, correct format) → `AssertionError: len(ngram.symbols) == self._cur_order`
4. **Multiple ARPA files** of different sizes (27K, 82K, 127K sentences) — all fail the same way
5. **Verified files are valid** with `kenlm` Python bindings:
   ```python
   import kenlm
   lm = kenlm.Model("my_6gram.arpa")  # loads fine
   lm.score("DELTA FIVE SEVEN CLIMB FLIGHT LEVEL THREE FIVE ZERO")  # works
   ```

## Expected Behavior

`NGramGPULanguageModel.from_arpa()` should successfully parse valid KenLM-generated 6-gram ARPA files and enable shallow fusion during beam search decoding.

## Suspected Root Cause

The ARPA parser in `ngram_lm_batched.py` may have been tested primarily with lower-order n-grams (3-gram, 4-gram). With 6-gram models:
- Lines in higher-order sections may have edge cases (e.g., backoff weight formatting, empty lines between sections) that the parser doesn't handle
- The symbol count check may fail due to how the parser splits n-gram lines containing tab-separated probability, words, and backoff weight

## Impact

This bug blocks shallow fusion for anyone doing domain-adapted ASR with higher-order language models. In our case, we built a 127K-sentence ATC domain LM to improve WER from 6.09% toward sub-6% on Air Traffic Control transcription. KenLM shallow fusion typically provides 5-15% relative WER improvement in domain-specific ASR, but we cannot use it due to this parser bug.

## Workaround

None currently for GPU-accelerated shallow fusion within NeMo. The `kenlm` Python bindings can load the same files for post-hoc rescoring, but this doesn't integrate with NeMo's beam search decoding pipeline.

## Related Files

- `nemo/collections/asr/parts/submodules/ngram_lm_batched.py` — `_add_ngram()` method, line ~440
- `nemo/collections/asr/parts/submodules/ngram_lm_batched.py` — `from_arpa()` class method


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] NGramGPULanguageModel.from_arpa() fails with AssertionError on valid 6-gram ARPA files #15715

[Bug] `NGramGPULanguageModel.from_arpa()` fails with AssertionError on valid 6-gram ARPA files

Environment

Description

Steps to Reproduce

Error

What We Tried

Expected Behavior

Suspected Root Cause

Impact

Workaround

Related Files

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] NGramGPULanguageModel.from_arpa() fails with AssertionError on valid 6-gram ARPA files #15715

Description

[Bug] NGramGPULanguageModel.from_arpa() fails with AssertionError on valid 6-gram ARPA files

Environment

Description

Steps to Reproduce

Error

What We Tried

Expected Behavior

Suspected Root Cause

Impact

Workaround

Related Files

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[Bug] `NGramGPULanguageModel.from_arpa()` fails with AssertionError on valid 6-gram ARPA files