TransformerTokenizer reads attributes from raw backend that modern transformers doesn't set

## Problem

`outlines.models.mlxlm.MLXLM.__init__` wraps the tokenizer via `TransformerTokenizer(tokenizer._tokenizer)`. The `TransformerTokenizer.__init__` then reads `eos_token_id`, `eos_token`, and `all_special_tokens` from the raw `tokenizers.Tokenizer` backend:

```python
self.eos_token_id = self.tokenizer.eos_token_id
self.eos_token = self.tokenizer.eos_token
self.special_tokens = set(self.tokenizer.all_special_tokens)
```

In modern `transformers` (tested with tokenizers 0.25+), the raw `tokenizers.Tokenizer` backend does NOT have these attributes — they live only on the `PreTrainedTokenizerFast` wrapper. This causes:

```
AttributeError: 'tokenizers.Tokenizer' object has no attribute 'eos_token_id'
```

## Reproduction

```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-32B", trust_remote_code=True)
print(hasattr(tok, "eos_token_id"))         # True
print(hasattr(tok._tokenizer, "eos_token_id"))  # False
```

Then:
```python
from outlines.models import from_mlxlm
model = ...  # any mlx model
from_mlxlm(model, tok)  # AttributeError
```

## Environment

- outlines 1.2.12
- outlines_core 0.2.14  
- transformers 4.52+
- tokenizers 0.25+
- Platform: macOS, Apple Silicon (MLX)

## Suggested fix

`TransformerTokenizer.__init__` should read from the wrapper (`tokenizer`), not from `tokenizer._tokenizer`, for attributes that only exist on the wrapper:

```python
# Instead of:
self.eos_token_id = self.tokenizer.eos_token_id

# Use the original wrapper:
self.eos_token_id = getattr(original_tokenizer, "eos_token_id", None)
```

Or check both the raw backend and the wrapper.

## Workaround

Patching the raw backend before calling `from_mlxlm`:

```python
inner = tokenizer._tokenizer
for attr in ("eos_token_id", "eos_token", "all_special_tokens"):
    if not hasattr(inner, attr):
        setattr(inner, attr, getattr(tokenizer, attr, None))
```

## Additional: FSM state transition with MLX models

After patching the tokenizer, the FSM compiles but hits `No next state found for the current state: 256 with token ID: 198` on the first generated token from Qwen 3.5 models. This may be a separate vocabulary-to-FSM mapping issue where the FSM is built with a vocabulary that doesn't match the model's actual token space.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TransformerTokenizer reads attributes from raw backend that modern transformers doesn't set #1847

Problem

Reproduction

Environment

Suggested fix

Workaround

Additional: FSM state transition with MLX models

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TransformerTokenizer reads attributes from raw backend that modern transformers doesn't set #1847

Description

Problem

Reproduction

Environment

Suggested fix

Workaround

Additional: FSM state transition with MLX models

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions