UCT/CUDA_COPY/MD: Preserve registration extent for multi-handle VMM by nicolasnoble · Pull Request #11461 · openucx/ucx

nicolasnoble · 2026-05-14T16:45:27Z

Multi-handle CUDA VMM allocations map several cuMemCreate handles
contiguously into a single VA range via cuMemMap. On such allocations
cuMemGetAddressRange() returns only the bounds of the cuMem handle
containing the base pointer, not the full mapped extent.

uct_cuda_copy_md_sync_memops_get_address_range() unconditionally used
that potentially-shorter alloc_length to set mem_info->alloc_length
when UCX_CUDA_COPY_REG_WHOLE_ALLOC was on or auto, silently shrinking
the caller-requested registration extent. The truncated length flowed
into cuMemGetHandleForAddressRange() and ibv_reg_dmabuf_mr(),
producing an MR that only covered the first cuMem handle's physical
pages. Any subsequent RDMA operation crossing a chunk boundary failed
with IBV_WC_LOC_PROT_ERR; downstream this surfaced as
UCS_ERR_CONNECTION_RESET via the SRQ-attached QP being flushed.

Preserve the caller's requested extent when the underlying VMM
allocation is smaller. CUDA correctly exports a dmabuf spanning the
full multi-handle VA range when asked with the full size. The
whole-allocation expansion stays valid for single-handle VMM and for
multi-handle cases where the request already fits within one handle.

Runtime workaround on affected releases:
UCX_CUDA_COPY_REG_WHOLE_ALLOC=off.

Minimal UCP-only reproducer:
https://gist.github.com/nicolasnoble/e0e57eb5a1b902057ae3d1df59c039cf

Verified end-to-end on UCX 1.20.0, CUDA 12.x, ConnectX-7 +
nvidia-peermem: a 31.2 GB / 820-tensor multi-handle VMM region
transfers cross-node in 1.2 s at 212.4 Gbps with the fix, where
vanilla UCX aborts on the first WQE with Local protection error.

For multi-handle CUDA VMM allocations (multiple cuMemCreate handles mapped contiguously into a single virtual address range via cuMemMap), cuMemGetAddressRange() returns only the bounds of the cuMem handle containing the base pointer, not the full mapped extent. uct_cuda_copy_md_sync_memops_get_address_range() unconditionally used that potentially-shorter alloc_length to set mem_info->alloc_length when UCX_CUDA_COPY_REG_WHOLE_ALLOC was on or auto, silently shrinking the caller-requested registration extent. The truncated length flowed into cuMemGetHandleForAddressRange() and ibv_reg_dmabuf_mr(), producing a memory region that only covered the first cuMem handle's physical pages. Any subsequent RDMA operation that crossed a chunk boundary failed with IBV_WC_LOC_PROT_ERR; downstream this surfaced as UCS_ERR_CONNECTION_RESET via the SRQ-attached QP being flushed. Preserve the caller's requested extent when the underlying VMM allocation is smaller than the requested registration. CUDA exports a dmabuf spanning the full multi-handle VA range when asked with the full size; the whole-allocation expansion was only valid for single-handle allocations.

tvegas1 · 2026-05-18T08:16:14Z

+     * to cover only the first chunk, and any RDMA that crossed a chunk
+     * boundary would fail with IBV_WC_LOC_PROT_ERR. Preserve the caller's
+     * requested extent in that case. */
+    if (is_vmm && (alloc_length < length)) {


maybe we can fix the resolution to handle virtual memory like below (with suggested tested test):

diff --git a/src/uct/cuda/cuda_copy/cuda_copy_md.c b/src/uct/cuda/cuda_copy/cuda_copy_md.c index 8735bc067..887eac78f 100644 --- a/src/uct/cuda/cuda_copy/cuda_copy_md.c +++ b/src/uct/cuda/cuda_copy/cuda_copy_md.c @@ -600,9 +600,15 @@ static void uct_cuda_copy_md_sync_memops_get_address_range( goto out_ctx_pop; } - if (UCT_CUDADRV_FUNC_LOG_DEBUG( - cuMemGetAddressRange(&base_address, &alloc_length, address)) != - UCS_OK) { + if (UCT_CUDADRV_FUNC_LOG_DEBUG(cuPointerGetAttribute( + &base_address, CU_POINTER_ATTRIBUTE_RANGE_START_ADDR, + address)) != UCS_OK) { + goto out_ctx_pop; + } + + if (UCT_CUDADRV_FUNC_LOG_DEBUG(cuPointerGetAttribute( + &alloc_length, CU_POINTER_ATTRIBUTE_RANGE_SIZE, + address)) != UCS_OK) { goto out_ctx_pop; } diff --git a/test/gtest/uct/cuda/test_cuda_ipc_md.cc b/test/gtest/uct/cuda/test_cuda_ipc_md.cc index 4dfea9630..b043fafa0 100644 --- a/test/gtest/uct/cuda/test_cuda_ipc_md.cc +++ b/test/gtest/uct/cuda/test_cuda_ipc_md.cc @@ -555,3 +555,84 @@ UCS_TEST_F(test_cuda_ipc_cache_lru, stale_destroy_while_in_use) { EXPECT_EQ(0UL, m_cache->total_size); EXPECT_TRUE(ucs_list_is_empty(&m_cache->lru_list)); } + + +class test_cuda_copy_md : public test_md { +protected: + static void alloc_vmm(size_t num_chunks, CUdeviceptr *va_p, + std::vector<CUmemGenericAllocationHandle> *handles_p, + size_t *chunk_size_p) + { + CUmemAllocationProp prop = {}; + CUmemAccessDesc access = {}; + CUdevice cu_device; + size_t granularity; + + ASSERT_EQ(CUDA_SUCCESS, cuCtxGetDevice(&cu_device)); + + prop.type = CU_MEM_ALLOCATION_TYPE_PINNED; + prop.location.type = CU_MEM_LOCATION_TYPE_DEVICE; + prop.location.id = cu_device; + + ASSERT_EQ(CUDA_SUCCESS, + cuMemGetAllocationGranularity( + &granularity, &prop, + CU_MEM_ALLOC_GRANULARITY_MINIMUM)); + + *chunk_size_p = granularity; + size_t total = granularity * num_chunks; + + ASSERT_EQ(CUDA_SUCCESS, + cuMemAddressReserve(va_p, total, granularity, 0, 0)); + + handles_p->resize(num_chunks); + for (size_t i = 0; i < num_chunks; i++) { + ASSERT_EQ(CUDA_SUCCESS, + cuMemCreate(&(*handles_p)[i], granularity, &prop, 0)); + ASSERT_EQ(CUDA_SUCCESS, + cuMemMap(*va_p + i * granularity, granularity, 0, + (*handles_p)[i], 0)); + } + + access.location = prop.location; + access.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE; + ASSERT_EQ(CUDA_SUCCESS, cuMemSetAccess(*va_p, total, &access, 1)); + } + + static void free_vmm(CUdeviceptr va, + std::vector<CUmemGenericAllocationHandle> &handles, + size_t chunk_size) + { + size_t total = chunk_size * handles.size(); + for (size_t i = 0; i < handles.size(); i++) { + cuMemUnmap(va + i * chunk_size, chunk_size); + cuMemRelease(handles[i]); + } + cuMemAddressFree(va, total); + } +}; + +UCS_TEST_P(test_cuda_copy_md, vmm_multi_handle_range) { + const size_t num_chunks = 3; + CUdeviceptr va; + std::vector<CUmemGenericAllocationHandle> handles; + size_t chunk_size; + + alloc_vmm(num_chunks, &va, &handles, &chunk_size); + + uct_md_mem_attr_t mem_attr; + mem_attr.field_mask = UCT_MD_MEM_ATTR_FIELD_MEM_TYPE | + UCT_MD_MEM_ATTR_FIELD_BASE_ADDRESS | + UCT_MD_MEM_ATTR_FIELD_ALLOC_LENGTH; + + ASSERT_UCS_OK(uct_md_mem_query(md(), (void*)(va + chunk_size), chunk_size, + &mem_attr)); + + EXPECT_EQ(UCS_MEMORY_TYPE_CUDA, mem_attr.mem_type); + EXPECT_LE((uintptr_t)mem_attr.base_address, (uintptr_t)va); + EXPECT_GE(mem_attr.alloc_length, chunk_size * num_chunks); + + free_vmm(va, handles, chunk_size); +} + +_UCT_MD_INSTANTIATE_TEST_CASE(test_cuda_copy_md, cuda_cpy);

@tvegas1 Is having reservation size > mapped length not something that should be handled? I think the original code in master would shrink the length to the allocation size but your suggested change would use the reservation size with CU_POINTER_ATTRIBUTE_RANGE_SIZE

@nicolasnoble Probably worth to add a test for the multiple chunks regardless of the approach

good point, I first assumed we wanted to find biggest range but maybe we can keep limiting it to physical area anyways, since this reg whole alloc is for registration optimization afaiu

Test added in 322b366, taking your fixture as-is. Keeping the mapped-extent approach over the cuPointerGetAttribute switch per Tomer's reasoning - reg_whole_alloc is a registration optimization, widening into the unmapped reservation would be the wrong direction.

Correction on the previous: CI caught that 322b366 asserts behavior matching your original cuPointerGetAttribute(RANGE_SIZE) idea, not the mapped-extent approach we kept - querying one chunk at a chunk-aligned offset means cuMemGetAddressRange returns exactly that chunk, the is_vmm guard never fires, and the assertions fail.

Rewrote in 5e24655 to query the full multi-chunk range from va, so caller asks for chunk_size * num_chunks > what cuMemGetAddressRange returns, the guard fires, and alloc_length is preserved at the caller's extent. Dropped the base_address check since the guard returns early before mem_info->base_address is updated.

tvegas1 · 2026-05-18T08:17:03Z

@rakhmets / @tomerg-nvidia

Add a gtest for the multi-handle VMM case fixed by the previous commit. Allocate three cuMem handles contiguously into one VA range via cuMemAddressReserve + cuMemCreate + cuMemMap, query an offset address with the size of one chunk, and assert that uct_md_mem_query preserves the caller's requested extent (alloc_length >= chunk_size * num_chunks) rather than shrinking to cuMemGetAddressRange's per-handle bounds. Suggested by tvegas1 in the PR review thread.

The previous test queried one chunk-size at a chunk-aligned offset, which makes cuMemGetAddressRange return exactly that chunk and the new is_vmm guard never fires. Those assertions only held under the withdrawn cuPointerGetAttribute(RANGE_SIZE) approach. Query the full multi-handle range instead so the guard fires: caller asks for chunk_size * num_chunks, cuMemGetAddressRange returns one chunk, alloc_length < length, the guard preserves the caller's extent. Drop the base_address assertion since the guard returns early before mem_info->base_address is updated.

nicolasnoble mentioned this pull request May 14, 2026

feat(vmm-arena): per-allocation cuMemCreate + 16 TiB VA + 1-MR dmabuf ai-dynamo/modelexpress#282

Merged

AUTHORS: Add Nicolas 'Pixel' Noble

c55b21e

tvegas1 reviewed May 18, 2026

View reviewed changes

nicolasnoble added 2 commits May 27, 2026 11:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UCT/CUDA_COPY/MD: Preserve registration extent for multi-handle VMM#11461

UCT/CUDA_COPY/MD: Preserve registration extent for multi-handle VMM#11461
nicolasnoble wants to merge 4 commits into
openucx:masterfrom
nicolasnoble:nnoble/cuda-copy-vmm-multi-handle

nicolasnoble commented May 14, 2026

Uh oh!

tvegas1 May 18, 2026

Uh oh!

tomerg-nvidia May 18, 2026 •

edited

Loading

Uh oh!

tvegas1 May 18, 2026

Uh oh!

nicolasnoble May 27, 2026

Uh oh!

nicolasnoble May 28, 2026

Uh oh!

tvegas1 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nicolasnoble commented May 14, 2026

Uh oh!

tvegas1 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

tomerg-nvidia May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tvegas1 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

nicolasnoble May 27, 2026

Choose a reason for hiding this comment

Uh oh!

nicolasnoble May 28, 2026

Choose a reason for hiding this comment

Uh oh!

tvegas1 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tomerg-nvidia May 18, 2026 •

edited

Loading