Sharding along the sequence dim in CP for qwen_2.5_vl

Hello wondering how and where the sequence get sharded along the sequence dimension for CP? Some implementations in Megatron have `get_batch_on_this_cp_rank()` where it is clear that sharding happens at the data level before distributed computation of attention.

Going through `pretrain_qwen.py` it does not seem that sharding happens along the sequence dimension anywhere at the data level, the only place where sharding occurs is in `get_pos_emb_on_this_cp_rank` in `Megatron-LM/megatron/core/models/common/embeddings/rotary_pos_embedding.py`, and this pertains to positional embeddings who actually have seq_len*CP_world_size.

Does it happen in the attention backend?

It is important for developers who are trying to adapt the implementation to understand this. 

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharding along the sequence dim in CP for qwen_2.5_vl #727

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sharding along the sequence dim in CP for qwen_2.5_vl #727

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions