About Pretraining? #149

kadirnar · 2025-04-13T15:02:38Z

kadirnar
Apr 13, 2025

default_config.yml:
8xH100-80G-PCIe-NVLink

{
  "compute_environment": "LOCAL_MACHINE",
  "debug": false,
  "distributed_type": "MULTI_GPU",
  "downcast_bf16": false,
  "enable_cpu_affinity": false,
  "machine_rank": 0,
  "main_training_function": "main",
  "mixed_precision": "no",
  "num_machines": 1,
  "num_processes": 8,
  "rdzv_backend": "static",
  "same_network": false,
  "tpu_use_cluster": false,
  "tpu_use_sudo": false,
  "use_cpu": false
}

Error Message:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 36.00 MiB. GPU 0 has a total capacity of 79.11 GiB of which 13.94 MiB is free. Process 153750 has 12.04 GiB memory in use. Process 153753 has 11.47 GiB memory in use. Process 153751 has 8.29 GiB memory in use. Process 153755 has 11.66 GiB memory in use. Including non-PyTorch memory, this process has 6.04 GiB memory in use. Process 153756 has 9.97 GiB memory in use. Process 153754 has 9.51 GiB memory in use. Process 153749 has 10.07 GiB memory in use. Of the allocated memory 5.59 GiB is allocated by PyTorch, and 1.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

config.yml:

# Model
model_name: "meta-llama/Llama-3.2-3B-Instruct"  # Replace with your base model must be compatible with the tokenizer and transformers library
tokenizer_name: "meta-llama/Llama-3.2-3B-Instruct"

# Training Args
epochs: 1
batch_size: 1
number_processes: 8
pad_token: 128263
save_steps: 12000
learning_rate: 5.0e-5
ratio: 1:1  # Start with 1:1 for TTS and gradually decrease to 0:1

# Datasets
text_QA_dataset: "kadirnar/Emilia-DE-B000000"  #  Emilia De Question Answering Dataset
TTS_dataset: "kadirnar/Emilia-DE-Orpheus"  # Emilia De dataset

# Naming and paths
save_folder: "checkpoints"
project_name: "de-tts-orpheus"
run_name: "de-tts-pretrain"

kadirnar · 2025-04-13T15:09:29Z

kadirnar
Apr 13, 2025
Author

I get this error when testing the Llama-1b model:

Error Message:

[rank6]:     return wrapper_cls(module, **kwargs)
[rank6]:   File "/mnt/kadirnar/trainer/Orpheus-TTS/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 509, in __init__
[rank6]:     _init_param_handle_from_module(
[rank6]:   File "/mnt/kadirnar/trainer/Orpheus-TTS/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 618, in _init_param_handle_from_module
[rank6]:     state.compute_device = _get_compute_device(
[rank6]:   File "/mnt/kadirnar/trainer/Orpheus-TTS/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 1082, in _get_compute_device
[rank6]:     raise ValueError(
[rank6]: ValueError: Inconsistent compute device and `device_id` on rank 6: cuda:0 vs cuda:6
Resolving data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 45/45 [00:00<00:00, 99.10it/s]
Resolving data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 45/45 [00:00<00:00, 198677.56it/s]
Loading dataset shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 45/45 [00:00<00:00, 3422.56it/s]:

config.yml:

model_name: "meta-llama/Llama-3.2-1B-Instruct"  # Replace with your base model must be compatible with the tokenizer and transformers library
tokenizer_name: "meta-llama/Llama-3.2-1B-Instruct"

0 replies

amuvarma13 · 2025-04-13T16:38:41Z

amuvarma13
Apr 13, 2025
Maintainer

@amuvarma13

default_config.yml: 8xH100-80G-PCIe-NVLink

{
  "compute_environment": "LOCAL_MACHINE",
  "debug": false,
  "distributed_type": "MULTI_GPU",
  "downcast_bf16": false,
  "enable_cpu_affinity": false,
  "machine_rank": 0,
  "main_training_function": "main",
  "mixed_precision": "no",
  "num_machines": 1,
  "num_processes": 8,
  "rdzv_backend": "static",
  "same_network": false,
  "tpu_use_cluster": false,
  "tpu_use_sudo": false,
  "use_cpu": false
}

Error Message:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 36.00 MiB. GPU 0 has a total capacity of 79.11 GiB of which 13.94 MiB is free. Process 153750 has 12.04 GiB memory in use. Process 153753 has 11.47 GiB memory in use. Process 153751 has 8.29 GiB memory in use. Process 153755 has 11.66 GiB memory in use. Including non-PyTorch memory, this process has 6.04 GiB memory in use. Process 153756 has 9.97 GiB memory in use. Process 153754 has 9.51 GiB memory in use. Process 153749 has 10.07 GiB memory in use. Of the allocated memory 5.59 GiB is allocated by PyTorch, and 1.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

config.yml:

# Model
model_name: "meta-llama/Llama-3.2-3B-Instruct"  # Replace with your base model must be compatible with the tokenizer and transformers library
tokenizer_name: "meta-llama/Llama-3.2-3B-Instruct"

# Training Args
epochs: 1
batch_size: 1
number_processes: 8
pad_token: 128263
save_steps: 12000
learning_rate: 5.0e-5
ratio: 1:1  # Start with 1:1 for TTS and gradually decrease to 0:1

# Datasets
text_QA_dataset: "kadirnar/Emilia-DE-B000000"  #  Emilia De Question Answering Dataset
TTS_dataset: "kadirnar/Emilia-DE-Orpheus"  # Emilia De dataset

# Naming and paths
save_folder: "checkpoints"
project_name: "de-tts-orpheus"
run_name: "de-tts-pretrain"

What are your sequence lengths and are you using fsdp?

1 reply

kadirnar Apr 13, 2025
Author

I haven't made any changes to the train and config files. I only changed the dataset name. The train code has Fsdp.

I wrote the same dataset for:

# Datasets
text_QA_dataset: "kadirnar/Emilia-DE-B000000-Orpheus" 
TTS_dataset: "kadirnar/Emilia-DE-B000000-Orpheus"

Dataset:

kadirnar · 2025-04-13T20:15:33Z

kadirnar
Apr 13, 2025
Author

@amuvarma13 I'm starting the model training but it's not loading on any GPU. I tried with a single GPU and it worked. Could there be an issue with the multi-GPU code?

5 replies

kadirnar Apr 13, 2025
Author

I tried with 8xA100. It still didn't work.

amuvarma13 Apr 13, 2025
Maintainer

whats your start command - it should be accelerate launch train.py

kadirnar Apr 13, 2025
Author

I'm trying this to create a config file for 8xA100.

accelerate config default

I'm trying this code for the train.

accelerate launch train.py

amuvarma13 Apr 13, 2025
Maintainer

Can you paste train.py

kadirnar Apr 14, 2025
Author

@amuvarma13

I edited the config file and multi gpu worked. I will try again on a different 8xH100 server.
default_config.yml:

{
  "compute_environment": "LOCAL_MACHINE",
  "debug": false,
  "distributed_type": "MULTI_GPU",
  "downcast_bf16": false,
  "enable_cpu_affinity": true,
  "machine_rank": 0,
  "main_training_function": "main",
  "mixed_precision": "bf16",
  "num_machines": 1,
  "num_processes": 8,
  "rdzv_backend": "static",
  "same_network": true,
  "tpu_use_cluster": false,
  "tpu_use_sudo": false,
  "use_cpu": false,
  "fsdp_config": {
    "fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
    "fsdp_backward_prefetch_policy": "BACKWARD_PRE",
    "fsdp_forward_prefetch": true,
    "fsdp_offload_params": false,
    "fsdp_sharding_strategy": 1,
    "fsdp_state_dict_type": "FULL_STATE_DICT",
    "fsdp_transformer_layer_cls_to_wrap": ["LlamaDecoderLayer"]
  },
  "deepspeed_config": {
    "zero_stage": 0
  },
  "dynamo_config": {
    "dynamo_backend": "no"
  }
}

kadirnar · 2025-04-14T21:59:11Z

kadirnar
Apr 14, 2025
Author

@amuvarma13 I'm using the Emilia-De dataset and it's been 30k steps but the results are very bad. Am I making a mistake in the inference code?

pretrain code: https://colab.research.google.com/drive/10v9MIEbZOr_3V8ZcPAIh8MN7q2LjcstS?usp=sharing

Output:

2 replies

kadirnar Apr 14, 2025
Author

@amuvarma13 I used the inference code that I used for finetuning and it worked.

In this blog post, I see that it was trained for 30k steps for German. However, I'm getting poor results for 30k steps. The model is still training.

Blog: https://canopylabs.ai/releases/orpheus_can_speak_any_language
Dataset: https://huggingface.co/datasets/kadirnar/Emilia-DE
Base Model: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct

amuvarma13 Apr 15, 2025
Maintainer

we started with the english pretrained model for german (hence not that many steps) - not the base llama weights. You need a loss that is ~4 (or even less) for a good pretrained model. Most likely you just need to train on more data - also for efficiency id recommend packing sequences together into 8192 length sequences. As this will train your model on the same data on the same hardware more efficiently.

If I remember correctly the English pretrained model was around 1500 h100 hours but you can get a decent model starting from scratch probably even with 500 h100 hours (assuming your training set up is as efficient as possible).

also the text dataset is just text absolutely no audio going off the loss curve im guess that is also a text-speech dataset. The llama loss on text is generally around 2 or less, and it should aim to stay flattish.

kadirnar · 2025-04-15T14:59:32Z

kadirnar
Apr 15, 2025
Author

@amuvarma13

The model training is complete but it didn't save the tokenizer file. I'm getting this error. Where should I download the tokenizer file from? Or how can I create it? I trained the Llama-3.2-1B model. Therefore, downloading from the Orpheus model page might be incorrect.

OSError: Can't load tokenizer for 'Orpheus-Pretrain-De/checkpoint-250000/checkpoint-250000'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'Orpheus-Pretrain-De/checkpoint-250000/checkpoint-250000' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.

1 reply

kadirnar Apr 15, 2025
Author

@amuvarma13 I think you wrote in the wrong place. #123 (reply in thread)

I downloaded the tokenizer files from the canopylabs/orpheus-3b-0.1-pretrained model page, copied them to my trained model folder, and got a CUDA error.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[3], line 75
     72 all_attention_masks = torch.cat(all_attention_masks, dim=0)
     74 # Move tensors to GPU
---> 75 input_ids = all_padded_tensors.to("cuda")
     76 attention_mask = all_attention_masks.to("cuda")
     78 # Run inference with Unsloth

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

This might be because I trained with Llama3.2-1B. I'm using this for the tokenizer:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
tokenizer.save_pretrained("Orpheus-Pretrain-De/checkpoint-250000/checkpoint-250000")

Then I run the inference code. And I save it again.

kadirnar · 2025-04-15T20:10:37Z

kadirnar
Apr 15, 2025
Author

@amuvarma13
I ran the voice clone code using canopylabs/orpheus-3b-0.1-pretrained and it worked very well. However, the model I pretrained myself doesn't produce any sound. Could it be a tokenizer-related issue? How can I fix this?

Output:

text: Man könnte sagen, ich sei für diese Aufgabe prädestiniert.
voice: https://voca.ro/1lBITGTMkBVr

1 reply

amuvarma13 Apr 15, 2025
Maintainer

I doubt its a tokeniser issue but its not impossible more likely something else though:

is it terminating the generation early or just producing silence
What was the final loss you got for your training?

kadirnar · 2025-04-28T12:28:52Z

kadirnar
Apr 28, 2025
Author

My train.py:

import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments, AutoTokenizer
import numpy as np
from torch.distributed.fsdp.fully_sharded_data_parallel import FullStateDictConfig
from torch.distributed.fsdp import (
    FullyShardedDataParallel as FSDP, FullStateDictConfig, StateDictType)
from torch.utils.data import DataLoader, Dataset
from torch.utils.data.distributed import DistributedSampler
import yaml
import wandb
from huggingface_hub import HfApi

config_file = "config.yaml"

with open(config_file, "r") as file:
    config = yaml.safe_load(file)

dsn1 = config["text_QA_dataset"]
dsn2 = config["TTS_dataset"]

model_name = config["model_name"]
tokenizer_name = config["tokenizer_name"]

run_name = config["run_name"]
project_name = config["project_name"]
base_repo_id = config["save_folder"]

epochs = config["epochs"]
batch_size = config["batch_size"]
save_steps = config["save_steps"]
pad_token = config["pad_token"]
number_processes = config["number_processes"]
learning_rate = config["learning_rate"]
config_ratio = config["ratio"]




class BatchedRatioDataset(Dataset):
    def __init__(self, dataset1, dataset2, batch_total, ratio=config_ratio):
        self.dataset1 = dataset1
        self.dataset2 = dataset2
        self.batch_total = batch_total
        self.ratio = ratio  

        num_cycles_ds1 = len(dataset1) // (batch_total * ratio)
        num_cycles_ds2 = len(dataset2) // batch_total
        self.num_cycles = min(num_cycles_ds1, num_cycles_ds2)

        self.length = self.num_cycles * (ratio + 1) * batch_total

    def __len__(self):
        print("accessing length", self.length)
        return int(self.length)

    def __getitem__(self, index):
        # Compute the cycle length in terms of samples.
        cycle_length = (self.ratio + 1) * self.batch_total
        cycle = index // cycle_length
        pos_in_cycle = index % cycle_length

        if pos_in_cycle < self.ratio * self.batch_total:
            batch_in_cycle = pos_in_cycle // self.batch_total
            sample_in_batch = pos_in_cycle % self.batch_total
            ds1_index = cycle * self.ratio * self.batch_total + batch_in_cycle * self.batch_total + sample_in_batch
            return self.dataset1[ds1_index]
        else:
            # We are in the dataset2 batch for this cycle.
            sample_in_batch = pos_in_cycle - self.ratio * self.batch_total
            ds2_index = cycle * self.batch_total + sample_in_batch
            return self.dataset2[ds2_index]



class AlternatingDistributedSampler(DistributedSampler):
    def __init__(self, dataset, num_replicas=None, rank=None, shuffle=False):
        super().__init__(dataset, num_replicas=num_replicas, rank=rank, shuffle=shuffle)
        self.shuffle = shuffle

    def __iter__(self):
        indices = list(range(len(self.dataset)))
        indices = indices[self.rank:self.total_size:self.num_replicas]
        return iter(indices)


class FSDPTrainer(Trainer):
    def __init__(self, *args, log_ratio=config_ratio, **kwargs):
        super().__init__(*args, **kwargs)
        self.repo_id = base_repo_id
        self.api = HfApi()

        self.log_ratio = log_ratio
        self.text_step  = 0
        self.audio_step = 0

    def get_train_dataloader(self):
        sampler = AlternatingDistributedSampler(
            self.train_dataset,
            num_replicas=torch.distributed.get_world_size(),
            rank=torch.distributed.get_rank(),
            shuffle=False,
        )

        return DataLoader(
            self.train_dataset,
            batch_size=self.args.per_device_train_batch_size,
            sampler=sampler,
            collate_fn=self.data_collator,
            drop_last=self.args.dataloader_drop_last,
            num_workers=0,
            pin_memory=self.args.dataloader_pin_memory,
        )

    def log(self, logs, start_time=None):
        super().log(logs, start_time)
        if self.is_world_process_zero():
            global_step = self.state.global_step
            # Each cycle is (log_ratio + 1) steps: first log_ratio steps for text_loss, then one for audio_loss.
            cycle_length = self.log_ratio + 1
            
            # Only log to wandb if 'loss' is in the logs dictionary
            if "loss" in logs:
                if (global_step % cycle_length) + self.log_ratio - 1 < self.log_ratio:
                    wandb.log({"audio_loss": logs["loss"], "audio_step": self.audio_step})
                    self.audio_step += 1
                else:
                    wandb.log({"text_loss": logs["loss"], "text_step": self.text_step})
                    self.text_step += 1

    def save_model(self, output_dir=None, _internal_call=False):
        if output_dir is None:
            output_dir = self.args.output_dir
        self.save_and_push_model(output_dir)

    def save_and_push_model(self, output_dir):
        save_policy = FullStateDictConfig(offload_to_cpu=True, rank0_only=True)
        with FSDP.state_dict_type(self.model, StateDictType.FULL_STATE_DICT, save_policy):
            cpu_state_dict = self.model.state_dict()
        self.model.save_pretrained(output_dir, state_dict=cpu_state_dict)


def data_collator(features):
    # max_length = 2656 # set a crop based on vram - ideally you have stacked all sequences to the same length
    # from 3b on 8 h100s fsdp, at bf16, 8192 works well.
    input_ids = [f["input_ids"] for f in features]

    if any("attention_mask" not in f for f in features):
        attention_mask = [[1]*len(ids) for ids in input_ids]
    else:
        attention_mask = [f["attention_mask"] for f in features]

    if any("labels" not in f for f in features):
        labels = input_ids
    else:
        labels = [f["labels"] for f in features]

    input_ids = torch.nn.utils.rnn.pad_sequence([torch.tensor(
        i, dtype=torch.long) for i in input_ids], batch_first=True, padding_value=pad_token)
    attention_mask = torch.nn.utils.rnn.pad_sequence([torch.tensor(
        m, dtype=torch.long) for m in attention_mask], batch_first=True, padding_value=0)
    labels = torch.nn.utils.rnn.pad_sequence([torch.tensor(
        l, dtype=torch.long) for l in labels], batch_first=True, padding_value=-100)

    return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}


wandb.init(project=project_name, name=run_name)


import accelerate

# Setup accelerate (this initializes distributed environment)
accelerator = accelerate.Accelerator()
device = accelerator.device

tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

# Initialize model with proper dtype for Flash Attention 2.0 
# explicitly initializing on GPU, then letting FSDP handle the rest
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,  # Explicitly set dtype for Flash Attention
)

# Initialize model on first GPU to make Flash Attention happy
if accelerator.is_local_main_process:
    print(f"Pre-initializing model on {device} before FSDP")
model = model.to(device)

number_add_tokens = 7 * 4096 + 10
new_tokens = [f"<custom_token_{i}>" for i in range(0, number_add_tokens + 1)]
tokenizer.add_tokens(new_tokens)
model.resize_token_embeddings(len(tokenizer))


ds1 = load_dataset(dsn1, split="train")
ds2 = load_dataset(dsn2, split="train")


batch_total = batch_size * number_processes
train_dataset = BatchedRatioDataset(ds1, ds2, batch_total, ratio=config_ratio)


training_args = TrainingArguments(
    overwrite_output_dir=True,
    num_train_epochs=epochs,
    per_device_train_batch_size=batch_size,
    logging_steps=1,
    bf16=True,
    output_dir=f"./{base_repo_id}",
    fsdp="auto_wrap",
    report_to="wandb",
    save_steps=save_steps,
    remove_unused_columns=True,
    learning_rate=learning_rate,
    lr_scheduler_type="cosine", 
)


trainer = FSDPTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator,
    log_ratio=config_ratio
)

trainer.train()

1 reply

kadirnar Apr 28, 2025
Author

#93

kadirnar · 2025-04-29T00:09:55Z

kadirnar
Apr 29, 2025
Author

@amuvarma13 I have prepared a 500GB Japanese dataset. I performed data preprocessing using the 3b-pretrain model. I want to train Qwen3. Should I perform tokenization (data preprocessing) again for this? The training of the model has started but I'm not sure if it will give accurate results. What do you suggest?

Qwen3-0.6B:

16 replies

kadirnar May 7, 2025
Author

I didn't get a cuda memory error in the 0.6B model. However, the loss value isn't good.

amuvarma13 May 7, 2025
Maintainer

What lengths are your input_ids sequences?

kadirnar May 8, 2025
Author

When I start the training, this appears in the log section:

accessing length 2067328

I printed the input_ids value. It keeps generating different data.

accessing length 2067328
accessing length 2067328
input_ids.shape: torch.Size([1, 283])
input_ids.shape: torch.Size([1, 545])
input_ids.shape: torch.Size([1, 652])
input_ids.shape: torch.Size([1, 1293])input_ids.shape: torch.Size([1, 227])

input_ids.shape: torch.Size([1, 455])
input_ids.shape: torch.Size([1, 880])
input_ids.shape: torch.Size([1, 249])

Should I create the input_ids graph for all the data?

kadirnar May 10, 2025
Author

@amuvarma13

input_ids.shape values when I get a cuda memory error:

Am I getting a cuda memory error because it's processing 14079 pieces of valuable data?

amuvarma13 May 10, 2025
Maintainer

yeah 14079 is the issue its too long for the vram you have access too - id either remove ids over a certain length (for flash attention 2 bf16 llama 3b and 80gb vram, fsdp over 8 gpus thats ids over 8192) there will be some number depending on your set up. Or you can crop ids,labels,attn mask etc to a certain length in a data collator but yes the issue is the length of that specific input_ids row

athenasaurav · 2025-05-05T09:17:44Z

athenasaurav
May 5, 2025

Hello @kadirnar @amuvarma13

I am trying to do an extended training canopylabs/orpheus-3b-0.1-pretrained in a new language and then further finetune it in that language using multi speakers styles. Now Just to make sure if i m doing it write i have created something dummy and entire code for text_QA_dataset (TTS_dataset i already understand as i have done Finetuning couple of times)

Here is the link to a dummy QA dataset in English. Its not my actual dataset which is in different language

Then i have used following code to tokenize is as mentioned in Issue#37 :

import os
from datasets import load_dataset
from transformers import AutoTokenizer
from huggingface_hub import login

print("Starting text QA dataset tokenization...")

# HuggingFace login (token hidden for security)
hf_token = "hf_xxx"  # <-- Replace with your token securely
login(token=hf_token)
print("Successfully logged in to HuggingFace!")

dataset_name = "athenasaurav/dummy_qa_dataset"
name_to_push_dataset_to = "athenasaurav/dummy_qa_dataset_tokenized"
tokenizer_name = "canopylabs/orpheus-3b-0.1-pretrained"
max_length = 8192

# Special token IDs (from Orpheus codebase)
tokeniser_length = 128256
start_of_text = 128000
end_of_text = 128009
start_of_human = tokeniser_length + 3
end_of_human = tokeniser_length + 4
start_of_ai = tokeniser_length + 5
end_of_ai = tokeniser_length + 6
pad_token = tokeniser_length + 7

print("\nLoading dataset and tokenizer...")
ds = load_dataset(dataset_name, split="train")
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
print(f"Loaded dataset with {len(ds)} examples")

def qa_to_input_ids(example):
    question = example["Question"].strip()
    answer = example["Answer"].strip()
    question_ids = tokenizer.encode(question, add_special_tokens=False)
    answer_ids = tokenizer.encode(answer, add_special_tokens=False)
    input_ids = (
        [start_of_human]
        + [start_of_text]
        + question_ids
        + [end_of_text]
        + [end_of_human]
        + [start_of_ai]
        + [start_of_text]
        + answer_ids
        + [end_of_text]
        + [end_of_ai]
    )
    if len(input_ids) < max_length:
        input_ids += [pad_token] * (max_length - len(input_ids))
    else:
        input_ids = input_ids[:max_length]
    example["input_ids"] = input_ids
    example["labels"] = input_ids
    example["attention_mask"] = [1 if i < len(input_ids) - (max_length - len(input_ids)) else 0 for i in range(max_length)]
    return example

print("\nTokenizing dataset...")
ds = ds.map(qa_to_input_ids, remove_columns=["Question", "Answer"])
print("Tokenization complete!")

print("\nCleaning up dataset columns...")
columns_to_keep = ["input_ids", "labels", "attention_mask"]
columns_to_remove = [col for col in ds.column_names if col not in columns_to_keep]
ds = ds.remove_columns(columns_to_remove)
print(f"Final columns: {ds.column_names}")

print(f"\nPushing dataset to hub at {name_to_push_dataset_to}...")
ds.push_to_hub(name_to_push_dataset_to)
print("Dataset successfully pushed to hub!")

And here is the final Tokenized QA dataset required for Extended training of canopylabs/orpheus-3b-0.1-pretrained

Please let me know if this is correct? This will hopefully help everyone who is trying to find a complete tutorial for Pretraining.

2 replies

kadirnar May 7, 2025
Author

That's correct. You can train with this dataset. However, it's a very small dataset. You should use a larger dataset.

athenasaurav May 12, 2025

Thanks @kadirnar

ahmedbadawy11 · 2025-05-05T14:44:58Z

ahmedbadawy11
May 5, 2025

Hi @kadirnar

I saw that you were facing an issue with inference using the pretrained Orpheus TTS on Arabic. I'm encountering a similar problem the generated audio files are empty. Could you please share your inference code?

5 replies

kadirnar May 5, 2025
Author

Hi @kadirnar

I saw that you were facing an issue with inference using the pretrained Orpheus TTS on Arabic. I'm encountering a similar problem the generated audio files are empty. Could you please share your inference code?

which tokenizer are you using?

ahmedbadawy11 May 6, 2025

canopylabs/orpheus-3b-0.1-pretrained

kadirnar May 6, 2025
Author

I don't know but it must be related to the tokenizer.

ahmedbadawy11 May 7, 2025

Thank you kadirnar

athenasaurav May 12, 2025

So i was also trying Arabic without pretraining and it seems that finetuning does works in these languages as @amuvarma13 mentioned that pretrained model are trained on English Dataset.

@ahmedbadawy11 did you pretrained in Arabic and if yes what was your dataset size?

kadirnar · 2025-06-28T17:33:54Z

kadirnar
Jun 28, 2025
Author

@amuvarma13

The config file of the Orpheus model has the pad_token value set to 128263. This is 7 more than the normal value. Default value: 128256 Is this an error? Or is it adding 7 to the original value because the training code does 7*4096 + 10?

Should I make this value 151936 + 7 while training Qwen? Or should this value remain constant?

0 replies

About Pretraining? #149

Uh oh!

Uh oh!

kadirnar Apr 13, 2025

Replies: 11 comments · 34 replies

Uh oh!

kadirnar Apr 13, 2025 Author

Uh oh!

amuvarma13 Apr 13, 2025 Maintainer

Uh oh!

kadirnar Apr 13, 2025 Author

Uh oh!

kadirnar Apr 13, 2025 Author

Uh oh!

kadirnar Apr 13, 2025 Author

Uh oh!

amuvarma13 Apr 13, 2025 Maintainer

Uh oh!

kadirnar Apr 13, 2025 Author

Uh oh!

amuvarma13 Apr 13, 2025 Maintainer

Uh oh!

kadirnar Apr 14, 2025 Author

Uh oh!

kadirnar Apr 14, 2025 Author

Uh oh!

kadirnar Apr 14, 2025 Author

Uh oh!

amuvarma13 Apr 15, 2025 Maintainer

Uh oh!

kadirnar Apr 15, 2025 Author

Uh oh!

Uh oh!

kadirnar Apr 15, 2025 Author

Uh oh!

Uh oh!

kadirnar Apr 15, 2025 Author

Uh oh!

amuvarma13 Apr 15, 2025 Maintainer

Uh oh!

kadirnar Apr 28, 2025 Author

Uh oh!

kadirnar Apr 28, 2025 Author

Uh oh!

Uh oh!

kadirnar Apr 29, 2025 Author

Uh oh!

kadirnar May 7, 2025 Author

Uh oh!

amuvarma13 May 7, 2025 Maintainer

Uh oh!

Uh oh!

kadirnar May 8, 2025 Author

Uh oh!

Uh oh!

kadirnar May 10, 2025 Author

Uh oh!

amuvarma13 May 10, 2025 Maintainer

Uh oh!

athenasaurav May 5, 2025

Uh oh!

kadirnar May 7, 2025 Author

Uh oh!

athenasaurav May 12, 2025

Uh oh!

ahmedbadawy11 May 5, 2025

Uh oh!

kadirnar
Apr 13, 2025

Replies: 11 comments 34 replies

kadirnar
Apr 13, 2025
Author

amuvarma13
Apr 13, 2025
Maintainer

kadirnar Apr 13, 2025
Author

kadirnar
Apr 13, 2025
Author

kadirnar Apr 13, 2025
Author

amuvarma13 Apr 13, 2025
Maintainer

kadirnar Apr 13, 2025
Author

amuvarma13 Apr 13, 2025
Maintainer

kadirnar Apr 14, 2025
Author

kadirnar
Apr 14, 2025
Author

kadirnar Apr 14, 2025
Author

amuvarma13 Apr 15, 2025
Maintainer

kadirnar
Apr 15, 2025
Author

kadirnar Apr 15, 2025
Author

kadirnar
Apr 15, 2025
Author

amuvarma13 Apr 15, 2025
Maintainer

kadirnar
Apr 28, 2025
Author

kadirnar Apr 28, 2025
Author

kadirnar
Apr 29, 2025
Author

kadirnar May 7, 2025
Author

amuvarma13 May 7, 2025
Maintainer

kadirnar May 8, 2025
Author

kadirnar May 10, 2025
Author

amuvarma13 May 10, 2025
Maintainer

athenasaurav
May 5, 2025

kadirnar May 7, 2025
Author

ahmedbadawy11
May 5, 2025