Skip to content

TransformerTokenizer reads attributes from raw backend that modern transformers doesn't set #1847

@Thump604

Description

@Thump604

Problem

outlines.models.mlxlm.MLXLM.__init__ wraps the tokenizer via TransformerTokenizer(tokenizer._tokenizer). The TransformerTokenizer.__init__ then reads eos_token_id, eos_token, and all_special_tokens from the raw tokenizers.Tokenizer backend:

self.eos_token_id = self.tokenizer.eos_token_id
self.eos_token = self.tokenizer.eos_token
self.special_tokens = set(self.tokenizer.all_special_tokens)

In modern transformers (tested with tokenizers 0.25+), the raw tokenizers.Tokenizer backend does NOT have these attributes — they live only on the PreTrainedTokenizerFast wrapper. This causes:

AttributeError: 'tokenizers.Tokenizer' object has no attribute 'eos_token_id'

Reproduction

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-32B", trust_remote_code=True)
print(hasattr(tok, "eos_token_id"))         # True
print(hasattr(tok._tokenizer, "eos_token_id"))  # False

Then:

from outlines.models import from_mlxlm
model = ...  # any mlx model
from_mlxlm(model, tok)  # AttributeError

Environment

  • outlines 1.2.12
  • outlines_core 0.2.14
  • transformers 4.52+
  • tokenizers 0.25+
  • Platform: macOS, Apple Silicon (MLX)

Suggested fix

TransformerTokenizer.__init__ should read from the wrapper (tokenizer), not from tokenizer._tokenizer, for attributes that only exist on the wrapper:

# Instead of:
self.eos_token_id = self.tokenizer.eos_token_id

# Use the original wrapper:
self.eos_token_id = getattr(original_tokenizer, "eos_token_id", None)

Or check both the raw backend and the wrapper.

Workaround

Patching the raw backend before calling from_mlxlm:

inner = tokenizer._tokenizer
for attr in ("eos_token_id", "eos_token", "all_special_tokens"):
    if not hasattr(inner, attr):
        setattr(inner, attr, getattr(tokenizer, attr, None))

Additional: FSM state transition with MLX models

After patching the tokenizer, the FSM compiles but hits No next state found for the current state: 256 with token ID: 198 on the first generated token from Qwen 3.5 models. This may be a separate vocabulary-to-FSM mapping issue where the FSM is built with a vocabulary that doesn't match the model's actual token space.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions