Skip to content

UCT/CUDA_COPY/MD: Preserve registration extent for multi-handle VMM#11461

Open
nicolasnoble wants to merge 4 commits into
openucx:masterfrom
nicolasnoble:nnoble/cuda-copy-vmm-multi-handle
Open

UCT/CUDA_COPY/MD: Preserve registration extent for multi-handle VMM#11461
nicolasnoble wants to merge 4 commits into
openucx:masterfrom
nicolasnoble:nnoble/cuda-copy-vmm-multi-handle

Conversation

@nicolasnoble
Copy link
Copy Markdown

Multi-handle CUDA VMM allocations map several cuMemCreate handles
contiguously into a single VA range via cuMemMap. On such allocations
cuMemGetAddressRange() returns only the bounds of the cuMem handle
containing the base pointer, not the full mapped extent.

uct_cuda_copy_md_sync_memops_get_address_range() unconditionally used
that potentially-shorter alloc_length to set mem_info->alloc_length
when UCX_CUDA_COPY_REG_WHOLE_ALLOC was on or auto, silently shrinking
the caller-requested registration extent. The truncated length flowed
into cuMemGetHandleForAddressRange() and ibv_reg_dmabuf_mr(),
producing an MR that only covered the first cuMem handle's physical
pages. Any subsequent RDMA operation crossing a chunk boundary failed
with IBV_WC_LOC_PROT_ERR; downstream this surfaced as
UCS_ERR_CONNECTION_RESET via the SRQ-attached QP being flushed.

Preserve the caller's requested extent when the underlying VMM
allocation is smaller. CUDA correctly exports a dmabuf spanning the
full multi-handle VA range when asked with the full size. The
whole-allocation expansion stays valid for single-handle VMM and for
multi-handle cases where the request already fits within one handle.

Runtime workaround on affected releases:
UCX_CUDA_COPY_REG_WHOLE_ALLOC=off.

Minimal UCP-only reproducer:
https://gist.github.com/nicolasnoble/e0e57eb5a1b902057ae3d1df59c039cf

Verified end-to-end on UCX 1.20.0, CUDA 12.x, ConnectX-7 +
nvidia-peermem: a 31.2 GB / 820-tensor multi-handle VMM region
transfers cross-node in 1.2 s at 212.4 Gbps with the fix, where
vanilla UCX aborts on the first WQE with Local protection error.

For multi-handle CUDA VMM allocations (multiple cuMemCreate handles
mapped contiguously into a single virtual address range via cuMemMap),
cuMemGetAddressRange() returns only the bounds of the cuMem handle
containing the base pointer, not the full mapped extent.

uct_cuda_copy_md_sync_memops_get_address_range() unconditionally used
that potentially-shorter alloc_length to set mem_info->alloc_length
when UCX_CUDA_COPY_REG_WHOLE_ALLOC was on or auto, silently shrinking
the caller-requested registration extent. The truncated length flowed
into cuMemGetHandleForAddressRange() and ibv_reg_dmabuf_mr(), producing
a memory region that only covered the first cuMem handle's physical
pages. Any subsequent RDMA operation that crossed a chunk boundary
failed with IBV_WC_LOC_PROT_ERR; downstream this surfaced as
UCS_ERR_CONNECTION_RESET via the SRQ-attached QP being flushed.

Preserve the caller's requested extent when the underlying VMM
allocation is smaller than the requested registration. CUDA exports a
dmabuf spanning the full multi-handle VA range when asked with the
full size; the whole-allocation expansion was only valid for
single-handle allocations.
* to cover only the first chunk, and any RDMA that crossed a chunk
* boundary would fail with IBV_WC_LOC_PROT_ERR. Preserve the caller's
* requested extent in that case. */
if (is_vmm && (alloc_length < length)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can fix the resolution to handle virtual memory like below (with suggested tested test):

diff --git a/src/uct/cuda/cuda_copy/cuda_copy_md.c b/src/uct/cuda/cuda_copy/cuda_copy_md.c
index 8735bc067..887eac78f 100644
--- a/src/uct/cuda/cuda_copy/cuda_copy_md.c
+++ b/src/uct/cuda/cuda_copy/cuda_copy_md.c
@@ -600,9 +600,15 @@ static void uct_cuda_copy_md_sync_memops_get_address_range(
         goto out_ctx_pop;
     }

-    if (UCT_CUDADRV_FUNC_LOG_DEBUG(
-                cuMemGetAddressRange(&base_address, &alloc_length, address)) !=
-        UCS_OK) {
+    if (UCT_CUDADRV_FUNC_LOG_DEBUG(cuPointerGetAttribute(
+                &base_address, CU_POINTER_ATTRIBUTE_RANGE_START_ADDR,
+                address)) != UCS_OK) {
+        goto out_ctx_pop;
+    }
+
+    if (UCT_CUDADRV_FUNC_LOG_DEBUG(cuPointerGetAttribute(
+                &alloc_length, CU_POINTER_ATTRIBUTE_RANGE_SIZE,
+                address)) != UCS_OK) {
         goto out_ctx_pop;
     }

diff --git a/test/gtest/uct/cuda/test_cuda_ipc_md.cc b/test/gtest/uct/cuda/test_cuda_ipc_md.cc
index 4dfea9630..b043fafa0 100644
--- a/test/gtest/uct/cuda/test_cuda_ipc_md.cc
+++ b/test/gtest/uct/cuda/test_cuda_ipc_md.cc
@@ -555,3 +555,84 @@ UCS_TEST_F(test_cuda_ipc_cache_lru, stale_destroy_while_in_use) {
     EXPECT_EQ(0UL, m_cache->total_size);
     EXPECT_TRUE(ucs_list_is_empty(&m_cache->lru_list));
 }
+
+
+class test_cuda_copy_md : public test_md {
+protected:
+    static void alloc_vmm(size_t num_chunks, CUdeviceptr *va_p,
+                          std::vector<CUmemGenericAllocationHandle> *handles_p,
+                          size_t *chunk_size_p)
+    {
+        CUmemAllocationProp prop = {};
+        CUmemAccessDesc access   = {};
+        CUdevice cu_device;
+        size_t granularity;
+
+        ASSERT_EQ(CUDA_SUCCESS, cuCtxGetDevice(&cu_device));
+
+        prop.type          = CU_MEM_ALLOCATION_TYPE_PINNED;
+        prop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
+        prop.location.id   = cu_device;
+
+        ASSERT_EQ(CUDA_SUCCESS,
+                  cuMemGetAllocationGranularity(
+                          &granularity, &prop,
+                          CU_MEM_ALLOC_GRANULARITY_MINIMUM));
+
+        *chunk_size_p = granularity;
+        size_t total  = granularity * num_chunks;
+
+        ASSERT_EQ(CUDA_SUCCESS,
+                  cuMemAddressReserve(va_p, total, granularity, 0, 0));
+
+        handles_p->resize(num_chunks);
+        for (size_t i = 0; i < num_chunks; i++) {
+            ASSERT_EQ(CUDA_SUCCESS,
+                      cuMemCreate(&(*handles_p)[i], granularity, &prop, 0));
+            ASSERT_EQ(CUDA_SUCCESS,
+                      cuMemMap(*va_p + i * granularity, granularity, 0,
+                               (*handles_p)[i], 0));
+        }
+
+        access.location = prop.location;
+        access.flags    = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
+        ASSERT_EQ(CUDA_SUCCESS, cuMemSetAccess(*va_p, total, &access, 1));
+    }
+
+    static void free_vmm(CUdeviceptr va,
+                          std::vector<CUmemGenericAllocationHandle> &handles,
+                          size_t chunk_size)
+    {
+        size_t total = chunk_size * handles.size();
+        for (size_t i = 0; i < handles.size(); i++) {
+            cuMemUnmap(va + i * chunk_size, chunk_size);
+            cuMemRelease(handles[i]);
+        }
+        cuMemAddressFree(va, total);
+    }
+};
+
+UCS_TEST_P(test_cuda_copy_md, vmm_multi_handle_range) {
+    const size_t num_chunks = 3;
+    CUdeviceptr va;
+    std::vector<CUmemGenericAllocationHandle> handles;
+    size_t chunk_size;
+
+    alloc_vmm(num_chunks, &va, &handles, &chunk_size);
+
+    uct_md_mem_attr_t mem_attr;
+    mem_attr.field_mask = UCT_MD_MEM_ATTR_FIELD_MEM_TYPE |
+                          UCT_MD_MEM_ATTR_FIELD_BASE_ADDRESS |
+                          UCT_MD_MEM_ATTR_FIELD_ALLOC_LENGTH;
+
+    ASSERT_UCS_OK(uct_md_mem_query(md(), (void*)(va + chunk_size), chunk_size,
+                                   &mem_attr));
+
+    EXPECT_EQ(UCS_MEMORY_TYPE_CUDA, mem_attr.mem_type);
+    EXPECT_LE((uintptr_t)mem_attr.base_address, (uintptr_t)va);
+    EXPECT_GE(mem_attr.alloc_length, chunk_size * num_chunks);
+
+    free_vmm(va, handles, chunk_size);
+}
+
+_UCT_MD_INSTANTIATE_TEST_CASE(test_cuda_copy_md, cuda_cpy);

Copy link
Copy Markdown
Contributor

@tomerg-nvidia tomerg-nvidia May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tvegas1 Is having reservation size > mapped length not something that should be handled? I think the original code in master would shrink the length to the allocation size but your suggested change would use the reservation size with CU_POINTER_ATTRIBUTE_RANGE_SIZE

@nicolasnoble Probably worth to add a test for the multiple chunks regardless of the approach

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I first assumed we wanted to find biggest range but maybe we can keep limiting it to physical area anyways, since this reg whole alloc is for registration optimization afaiu

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test added in 322b366, taking your fixture as-is. Keeping the mapped-extent approach over the cuPointerGetAttribute switch per Tomer's reasoning - reg_whole_alloc is a registration optimization, widening into the unmapped reservation would be the wrong direction.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction on the previous: CI caught that 322b366 asserts behavior matching your original cuPointerGetAttribute(RANGE_SIZE) idea, not the mapped-extent approach we kept - querying one chunk at a chunk-aligned offset means cuMemGetAddressRange returns exactly that chunk, the is_vmm guard never fires, and the assertions fail.

Rewrote in 5e24655 to query the full multi-chunk range from va, so caller asks for chunk_size * num_chunks > what cuMemGetAddressRange returns, the guard fires, and alloc_length is preserved at the caller's extent. Dropped the base_address check since the guard returns early before mem_info->base_address is updated.

@tvegas1
Copy link
Copy Markdown
Contributor

tvegas1 commented May 18, 2026

@rakhmets / @tomerg-nvidia

Add a gtest for the multi-handle VMM case fixed by the previous
commit. Allocate three cuMem handles contiguously into one VA
range via cuMemAddressReserve + cuMemCreate + cuMemMap, query an
offset address with the size of one chunk, and assert that
uct_md_mem_query preserves the caller's requested extent
(alloc_length >= chunk_size * num_chunks) rather than shrinking
to cuMemGetAddressRange's per-handle bounds.

Suggested by tvegas1 in the PR review thread.
The previous test queried one chunk-size at a chunk-aligned offset,
which makes cuMemGetAddressRange return exactly that chunk and the
new is_vmm guard never fires. Those assertions only held under the
withdrawn cuPointerGetAttribute(RANGE_SIZE) approach.

Query the full multi-handle range instead so the guard fires:
caller asks for chunk_size * num_chunks, cuMemGetAddressRange
returns one chunk, alloc_length < length, the guard preserves the
caller's extent. Drop the base_address assertion since the guard
returns early before mem_info->base_address is updated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants