Skip to content

get_rank returns 0 regardless of the rank when running aiaccel-train in an ABCI environment. #556

@KanaiYuma-aist

Description

@KanaiYuma-aist

Describe the bug
ABCI のジョブ内で aiaccel-train を実行すると、rank に依らずに get_rank が 0(default) を返します

以下のように、get_rank 内で LOCAL_RANK を参照するようにすると、正常に rank を返すようになります

def get_rank(default: int = 0) -> int:
    for key in [
        "LOCAL_RANK",  # PyTorch Lightning
        "RANK",  # torchrun / deepspeed / pytorch launcher
        "OMPI_COMM_WORLD_RANK",  # OpenMPI
        "PMI_RANK",  # MPICH / Intel MPI
        "MV2_COMM_WORLD_RANK",  # MVAPICH2
        "SLURM_PROCID",  # Slurm
    ]:

Expected behavior
ABCI のジョブ内でも、get_rank が rank 毎に正しい rank を返してほしいです

Additional context
参考 pytorch lightning の _get_rank
https://github.com/Lightning-AI/pytorch-lightning/blob/c05cadbe5be3bfa8bacbd9d7e912fa2e456413ff/src/lightning/fabric/utilities/rank_zero.py#L35-L44

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions