Replies: 11 comments 34 replies
-
|
I get this error when testing the Llama-1b model: Error Message: [rank6]: return wrapper_cls(module, **kwargs)
[rank6]: File "/mnt/kadirnar/trainer/Orpheus-TTS/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 509, in __init__
[rank6]: _init_param_handle_from_module(
[rank6]: File "/mnt/kadirnar/trainer/Orpheus-TTS/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 618, in _init_param_handle_from_module
[rank6]: state.compute_device = _get_compute_device(
[rank6]: File "/mnt/kadirnar/trainer/Orpheus-TTS/.venv/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 1082, in _get_compute_device
[rank6]: raise ValueError(
[rank6]: ValueError: Inconsistent compute device and `device_id` on rank 6: cuda:0 vs cuda:6
Resolving data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 45/45 [00:00<00:00, 99.10it/s]
Resolving data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 45/45 [00:00<00:00, 198677.56it/s]
Loading dataset shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 45/45 [00:00<00:00, 3422.56it/s]: config.yml: |
Beta Was this translation helpful? Give feedback.
-
What are your sequence lengths and are you using fsdp? |
Beta Was this translation helpful? Give feedback.
-
|
@amuvarma13 I'm starting the model training but it's not loading on any GPU. I tried with a single GPU and it worked. Could there be an issue with the multi-GPU code? |
Beta Was this translation helpful? Give feedback.
-
|
@amuvarma13 I'm using the Emilia-De dataset and it's been 30k steps but the results are very bad. Am I making a mistake in the inference code? pretrain code: https://colab.research.google.com/drive/10v9MIEbZOr_3V8ZcPAIh8MN7q2LjcstS?usp=sharing Output: |
Beta Was this translation helpful? Give feedback.
-
|
The model training is complete but it didn't save the tokenizer file. I'm getting this error. Where should I download the tokenizer file from? Or how can I create it? I trained the Llama-3.2-1B model. Therefore, downloading from the Orpheus model page might be incorrect. OSError: Can't load tokenizer for 'Orpheus-Pretrain-De/checkpoint-250000/checkpoint-250000'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'Orpheus-Pretrain-De/checkpoint-250000/checkpoint-250000' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer. |
Beta Was this translation helpful? Give feedback.
-
|
@amuvarma13 Output: text: Man könnte sagen, ich sei für diese Aufgabe prädestiniert. |
Beta Was this translation helpful? Give feedback.
-
|
My train.py: import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments, AutoTokenizer
import numpy as np
from torch.distributed.fsdp.fully_sharded_data_parallel import FullStateDictConfig
from torch.distributed.fsdp import (
FullyShardedDataParallel as FSDP, FullStateDictConfig, StateDictType)
from torch.utils.data import DataLoader, Dataset
from torch.utils.data.distributed import DistributedSampler
import yaml
import wandb
from huggingface_hub import HfApi
config_file = "config.yaml"
with open(config_file, "r") as file:
config = yaml.safe_load(file)
dsn1 = config["text_QA_dataset"]
dsn2 = config["TTS_dataset"]
model_name = config["model_name"]
tokenizer_name = config["tokenizer_name"]
run_name = config["run_name"]
project_name = config["project_name"]
base_repo_id = config["save_folder"]
epochs = config["epochs"]
batch_size = config["batch_size"]
save_steps = config["save_steps"]
pad_token = config["pad_token"]
number_processes = config["number_processes"]
learning_rate = config["learning_rate"]
config_ratio = config["ratio"]
class BatchedRatioDataset(Dataset):
def __init__(self, dataset1, dataset2, batch_total, ratio=config_ratio):
self.dataset1 = dataset1
self.dataset2 = dataset2
self.batch_total = batch_total
self.ratio = ratio
num_cycles_ds1 = len(dataset1) // (batch_total * ratio)
num_cycles_ds2 = len(dataset2) // batch_total
self.num_cycles = min(num_cycles_ds1, num_cycles_ds2)
self.length = self.num_cycles * (ratio + 1) * batch_total
def __len__(self):
print("accessing length", self.length)
return int(self.length)
def __getitem__(self, index):
# Compute the cycle length in terms of samples.
cycle_length = (self.ratio + 1) * self.batch_total
cycle = index // cycle_length
pos_in_cycle = index % cycle_length
if pos_in_cycle < self.ratio * self.batch_total:
batch_in_cycle = pos_in_cycle // self.batch_total
sample_in_batch = pos_in_cycle % self.batch_total
ds1_index = cycle * self.ratio * self.batch_total + batch_in_cycle * self.batch_total + sample_in_batch
return self.dataset1[ds1_index]
else:
# We are in the dataset2 batch for this cycle.
sample_in_batch = pos_in_cycle - self.ratio * self.batch_total
ds2_index = cycle * self.batch_total + sample_in_batch
return self.dataset2[ds2_index]
class AlternatingDistributedSampler(DistributedSampler):
def __init__(self, dataset, num_replicas=None, rank=None, shuffle=False):
super().__init__(dataset, num_replicas=num_replicas, rank=rank, shuffle=shuffle)
self.shuffle = shuffle
def __iter__(self):
indices = list(range(len(self.dataset)))
indices = indices[self.rank:self.total_size:self.num_replicas]
return iter(indices)
class FSDPTrainer(Trainer):
def __init__(self, *args, log_ratio=config_ratio, **kwargs):
super().__init__(*args, **kwargs)
self.repo_id = base_repo_id
self.api = HfApi()
self.log_ratio = log_ratio
self.text_step = 0
self.audio_step = 0
def get_train_dataloader(self):
sampler = AlternatingDistributedSampler(
self.train_dataset,
num_replicas=torch.distributed.get_world_size(),
rank=torch.distributed.get_rank(),
shuffle=False,
)
return DataLoader(
self.train_dataset,
batch_size=self.args.per_device_train_batch_size,
sampler=sampler,
collate_fn=self.data_collator,
drop_last=self.args.dataloader_drop_last,
num_workers=0,
pin_memory=self.args.dataloader_pin_memory,
)
def log(self, logs, start_time=None):
super().log(logs, start_time)
if self.is_world_process_zero():
global_step = self.state.global_step
# Each cycle is (log_ratio + 1) steps: first log_ratio steps for text_loss, then one for audio_loss.
cycle_length = self.log_ratio + 1
# Only log to wandb if 'loss' is in the logs dictionary
if "loss" in logs:
if (global_step % cycle_length) + self.log_ratio - 1 < self.log_ratio:
wandb.log({"audio_loss": logs["loss"], "audio_step": self.audio_step})
self.audio_step += 1
else:
wandb.log({"text_loss": logs["loss"], "text_step": self.text_step})
self.text_step += 1
def save_model(self, output_dir=None, _internal_call=False):
if output_dir is None:
output_dir = self.args.output_dir
self.save_and_push_model(output_dir)
def save_and_push_model(self, output_dir):
save_policy = FullStateDictConfig(offload_to_cpu=True, rank0_only=True)
with FSDP.state_dict_type(self.model, StateDictType.FULL_STATE_DICT, save_policy):
cpu_state_dict = self.model.state_dict()
self.model.save_pretrained(output_dir, state_dict=cpu_state_dict)
def data_collator(features):
# max_length = 2656 # set a crop based on vram - ideally you have stacked all sequences to the same length
# from 3b on 8 h100s fsdp, at bf16, 8192 works well.
input_ids = [f["input_ids"] for f in features]
if any("attention_mask" not in f for f in features):
attention_mask = [[1]*len(ids) for ids in input_ids]
else:
attention_mask = [f["attention_mask"] for f in features]
if any("labels" not in f for f in features):
labels = input_ids
else:
labels = [f["labels"] for f in features]
input_ids = torch.nn.utils.rnn.pad_sequence([torch.tensor(
i, dtype=torch.long) for i in input_ids], batch_first=True, padding_value=pad_token)
attention_mask = torch.nn.utils.rnn.pad_sequence([torch.tensor(
m, dtype=torch.long) for m in attention_mask], batch_first=True, padding_value=0)
labels = torch.nn.utils.rnn.pad_sequence([torch.tensor(
l, dtype=torch.long) for l in labels], batch_first=True, padding_value=-100)
return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}
wandb.init(project=project_name, name=run_name)
import accelerate
# Setup accelerate (this initializes distributed environment)
accelerator = accelerate.Accelerator()
device = accelerator.device
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
# Initialize model with proper dtype for Flash Attention 2.0
# explicitly initializing on GPU, then letting FSDP handle the rest
model = AutoModelForCausalLM.from_pretrained(
model_name,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16, # Explicitly set dtype for Flash Attention
)
# Initialize model on first GPU to make Flash Attention happy
if accelerator.is_local_main_process:
print(f"Pre-initializing model on {device} before FSDP")
model = model.to(device)
number_add_tokens = 7 * 4096 + 10
new_tokens = [f"<custom_token_{i}>" for i in range(0, number_add_tokens + 1)]
tokenizer.add_tokens(new_tokens)
model.resize_token_embeddings(len(tokenizer))
ds1 = load_dataset(dsn1, split="train")
ds2 = load_dataset(dsn2, split="train")
batch_total = batch_size * number_processes
train_dataset = BatchedRatioDataset(ds1, ds2, batch_total, ratio=config_ratio)
training_args = TrainingArguments(
overwrite_output_dir=True,
num_train_epochs=epochs,
per_device_train_batch_size=batch_size,
logging_steps=1,
bf16=True,
output_dir=f"./{base_repo_id}",
fsdp="auto_wrap",
report_to="wandb",
save_steps=save_steps,
remove_unused_columns=True,
learning_rate=learning_rate,
lr_scheduler_type="cosine",
)
trainer = FSDPTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
data_collator=data_collator,
log_ratio=config_ratio
)
trainer.train() |
Beta Was this translation helpful? Give feedback.
-
|
@amuvarma13 I have prepared a 500GB Japanese dataset. I performed data preprocessing using the 3b-pretrain model. I want to train Qwen3. Should I perform tokenization (data preprocessing) again for this? The training of the model has started but I'm not sure if it will give accurate results. What do you suggest? |
Beta Was this translation helpful? Give feedback.
-
|
Hello @kadirnar @amuvarma13 I am trying to do an extended training Here is the link to a dummy QA dataset in English. Its not my actual dataset which is in different language Then i have used following code to tokenize is as mentioned in Issue#37 : And here is the final Tokenized QA dataset required for Extended training of Please let me know if this is correct? This will hopefully help everyone who is trying to find a complete tutorial for Pretraining. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @kadirnar I saw that you were facing an issue with inference using the pretrained Orpheus TTS on Arabic. I'm encountering a similar problem the generated audio files are empty. Could you please share your inference code? |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.





Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
@amuvarma13
default_config.yml:
8xH100-80G-PCIe-NVLink
{ "compute_environment": "LOCAL_MACHINE", "debug": false, "distributed_type": "MULTI_GPU", "downcast_bf16": false, "enable_cpu_affinity": false, "machine_rank": 0, "main_training_function": "main", "mixed_precision": "no", "num_machines": 1, "num_processes": 8, "rdzv_backend": "static", "same_network": false, "tpu_use_cluster": false, "tpu_use_sudo": false, "use_cpu": false }Error Message:
config.yml:
Beta Was this translation helpful? Give feedback.
All reactions