Skip to content

Sharding along the sequence dim in CP for qwen_2.5_vl #727

@bodsul

Description

@bodsul

Hello wondering how and where the sequence get sharded along the sequence dimension for CP? Some implementations in Megatron have get_batch_on_this_cp_rank() where it is clear that sharding happens at the data level before distributed computation of attention.

Going through pretrain_qwen.py it does not seem that sharding happens along the sequence dimension anywhere at the data level, the only place where sharding occurs is in get_pos_emb_on_this_cp_rank in Megatron-LM/megatron/core/models/common/embeddings/rotary_pos_embedding.py, and this pertains to positional embeddings who actually have seq_len*CP_world_size.

Does it happen in the attention backend?

It is important for developers who are trying to adapt the implementation to understand this.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions