Skip to content

[REVIEW] Add all-to-all, scatter, and scatterv functions to raft::comms_t#3002

Open
seunghwak wants to merge 15 commits into
rapidsai:mainfrom
seunghwak:fea_scatter
Open

[REVIEW] Add all-to-all, scatter, and scatterv functions to raft::comms_t#3002
seunghwak wants to merge 15 commits into
rapidsai:mainfrom
seunghwak:fea_scatter

Conversation

@seunghwak
Copy link
Copy Markdown
Contributor

@seunghwak seunghwak commented Apr 14, 2026

ncclAlltoAll, ncclScatter, and ncclGather was added in NCCL 2.28.3.

We previously performed all-to-all using multi-cast and gather was performed using send/receive. raft::comms_t lacks scatter and scatterv.

This PR adds alltoall (ncclAlltoAll wrapper), scatter (ncclScatter wrapper), and scatterv (implemented using send/receive).

This PR also updates gather to directly call ncclGather (instead of emulating gather using send/receive).

@seunghwak seunghwak added enhancement New feature or request 3 - Ready for Review cpp python non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Apr 14, 2026
@seunghwak seunghwak changed the title [WIP] Add all-to-all, scatter, and scatterv functions to raft::comms_t [REVIEW] Add all-to-all, scatter, and scatterv functions to raft::comms_t Apr 14, 2026
@aamijar aamijar moved this to In Progress in Unstructured Data Processing Apr 21, 2026
Copy link
Copy Markdown
Contributor

@tarang-jain tarang-jain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! These seem to be much needed. A few comments from my end.

Comment thread cpp/include/raft/comms/detail/test.hpp Outdated
stream));

communicator.alltoall(temp_d.data(), recv_d.data(), 1, stream);
communicator.sync_stream(stream);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to sync the communicator here? The operations after this are ordered on the same stream, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, many tests in this file have unnecessary stream synchronizations. Just followed the convention but deleted all of them in this file (excluding the necessary ones after moving data from device to host).

Comment thread cpp/include/raft/comms/detail/test.hpp Outdated
rmm::device_uvector<int> temp_d(communicator.get_size(), stream);
rmm::device_uvector<int> recv_d(communicator.get_size(), stream);

RAFT_CUDA_TRY(cudaMemcpyAsync(temp_d.data(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use raft::copy instead of direct calls to the CUDA function

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I used cudaMemcpyAsync to be consistent with the rest of the file but we should better replace cudaMemcpyAsync with raft calls in the entire file.

I replaced cudaMemcpyAsync with raft copy calls (raft::update_host & update_device).

Comment thread cpp/include/raft/comms/detail/test.hpp Outdated
if (communicator.get_rank() == root) {
std::vector<int> sends(communicator.get_size(), communicator.get_rank());
std::fill(sends.begin(), sends.end(), root);
RAFT_CUDA_TRY(cudaMemcpyAsync(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again use raft::copy here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread cpp/include/raft/comms/detail/test.hpp Outdated

int temp_h = -1; // Verify more than one byte is being sent
RAFT_CUDA_TRY(
cudaMemcpyAsync(&temp_h, recv_d.data(), sizeof(int), cudaMemcpyDeviceToHost, stream));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here as well. raft::copy

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread cpp/include/raft/comms/detail/test.hpp Outdated
recv_d.size(),
root,
stream);
communicator.sync_stream(stream);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again here -- is there a specific reason why you sync the whole stream after the multi-gpu operation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just following the convention in this file. No need, deleted all the stream synchronization after communication functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review cpp enhancement New feature or request improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

4 participants