Hello wondering how and where the sequence get sharded along the sequence dimension for CP? Some implementations in Megatron have get_batch_on_this_cp_rank() where it is clear that sharding happens at the data level before distributed computation of attention.
Going through pretrain_qwen.py it does not seem that sharding happens along the sequence dimension anywhere at the data level, the only place where sharding occurs is in get_pos_emb_on_this_cp_rank in Megatron-LM/megatron/core/models/common/embeddings/rotary_pos_embedding.py, and this pertains to positional embeddings who actually have seq_len*CP_world_size.
Does it happen in the attention backend?
It is important for developers who are trying to adapt the implementation to understand this.
Thanks!
Hello wondering how and where the sequence get sharded along the sequence dimension for CP? Some implementations in Megatron have
get_batch_on_this_cp_rank()where it is clear that sharding happens at the data level before distributed computation of attention.Going through
pretrain_qwen.pyit does not seem that sharding happens along the sequence dimension anywhere at the data level, the only place where sharding occurs is inget_pos_emb_on_this_cp_rankinMegatron-LM/megatron/core/models/common/embeddings/rotary_pos_embedding.py, and this pertains to positional embeddings who actually have seq_len*CP_world_size.Does it happen in the attention backend?
It is important for developers who are trying to adapt the implementation to understand this.
Thanks!