Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions rfcs/proposed/numa_support/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,8 @@ This [sub-proposal is supported](../../supported/numa_support/create-numa-arenas
Define allocators or other features that simplify the process of allocating or placing data onto
specific NUMA nodes.

[Interleaved allocation](interleaved-allocation.md) can be a useful kind of NUMA-aware allocations.

### Simplified approaches to associate task distribution with data placement

As discussed earlier, NUMA-aware allocation is just the first step in optimizing for NUMA architectures.
Expand Down
78 changes: 78 additions & 0 deletions rfcs/proposed/numa_support/interleaved-allocation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# API to allocate memory interleaved between NUMA nodes

*Note:* This document is a sub-RFC of the [umbrella RFC about improving NUMA
support](README.md).

## Motivation

There are two kinds of NUMA-related performance bottlenecks: latency increasing due to
access to a remote node and bandwidth-limited simultaneous access from different CPUs to
a single NUMA memory node. A well-known method to mitigate both is a distribution of
memory objects that are accessed from different CPUs to different NUMA nodes in such a way
that matches an access pattern. If the access pattern is complex enough, a simple
round-robin distribution can be good enough. The distribution can be achieved either by
employing a first-touch policy of NUMA memory allocation or via special platform-dependent
API. Generally, the latter requires less overhead.

## Requirements to public API

A free stateless function, similar to malloc, is sufficient for the allocation of large
blocks of memory, contiguous in the address space. To guide the mapping of memory
across NUMA nodes, two additional parameters are proposed: `interleaving step`
and `the list of NUMA nodes to get the memory from`. This function allocates whole
memory pages and does not employ internal caching. If smaller and repetitive allocations
are needed, then `std::pmr` or other solutions should be used.

`interleaving step` is the size of the contiguous memory block from a particular NUMA
node, it has page granularity. Currently there are no clear use cases for granularity more
than page size.

`list of nodes for allocation` is `std::vector<tbb::numa_node_id>` to be compatible with a
value returned from `tbb::numa_nodes()`. `libnuma` supports a subset of NUMA nodes for
allocation, but those nodes are loaded equally. Having `vector` allows us to express an
unbalanced load. Example: allocation over the list of nodes [3, 0, 3] uses 2/3 memory from
node 3 and 1/3 from node 0.

One use case for `list of nodes` argument is the desire to run parallel activity on subset
of nodes and so get memory only from those nodes.

Most common usage of the allocation function is expected only with `size` parameter.
Comment thread
vossmjp marked this conversation as resolved.
In this case, `interleaving_step` defaults to the page size and memory is allocated on all
NUMA nodes.
Comment on lines +40 to +41
Copy link
Copy Markdown
Contributor

@aleksei-fedotov aleksei-fedotov Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe using a named constant instead of a zero as the default argument can improve code readability.
Something like

const size_t page_size = []() {
    // Find out what page size is and return it
    size_t found_page_size = ...;
    return found_page_size;
};

Then we can specify it as the default value for interleaving_step parameter:

void *alloc_interleaved(size_t size, size_t interleaving_step = page_size,
                        const std::vector<tbb::numa_node_id> *nodes = nullptr);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we see it as that? No named constants, and no need to define which argument is more often has non-default values.

TBB_EXPORT void *__TBB_EXPORTED_FUNC alloc_interleaved(size_t size, size_t interleaving_step,
                        const tbb::detail::d1::numa_node_id *nodes, size_t nodes_count);

inline void *alloc_interleaved(size_t size) {
    return alloc_interleaved(size, 0, nullptr, 0);
}

inline void *alloc_interleaved(size_t size, const std::vector<tbb::detail::d1::numa_node_id> &nodes) {
    return alloc_interleaved(size, 0, nodes.data(), nodes.size());
}

inline void *alloc_interleaved(size_t size, size_t interleaving_step) {
    return alloc_interleaved(size, interleaving_step, nullptr, 0);
}

inline void *alloc_interleaved(size_t size, size_t interleaving_step,
                               const std::vector<tbb::detail::d1::numa_node_id> &nodes) {
    return alloc_interleaved(size, interleaving_step, nodes.data(), nodes.size());
}

Copy link
Copy Markdown
Contributor

@aleksei-fedotov aleksei-fedotov Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good proposal! I like the part where we don't need to define which argument is more often used as all the variants are available.

When I look at this, only two disadvantages come to my mind, which might not be important to address at all:

  1. It is not clear what 0 and nullptr mean. In this case, named constants that are not even required to be exposed to the user, can help as they will be used both in the interface and implementation part. They are only needed to a code reader to keep the intentions clear.
  2. This approach might lead to exponentially growing number of functions. If we would ever need one more parameter to this function for some reason, we will end up with eight different functions. However, this is not of a big deal as addition of one more parameter probably never happens.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, good points.


The following functions are provided to illustrate the conceptual API, not yet as the
recommended new API.

```c++
void *alloc_interleaved(size_t size, size_t interleaving_step = 0,
const std::vector<tbb::numa_node_id> *nodes = nullptr);
Comment on lines +47 to +48
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the discussion regarding API and have seen that RFC now illustrates rather mandates the interface, but I still believe we can write:

Suggested change
void *alloc_interleaved(size_t size, size_t interleaving_step = 0,
const std::vector<tbb::numa_node_id> *nodes = nullptr);
void *alloc_interleaved(size_t size, size_t interleaving_step = 0,
const std::vector<tbb::numa_node_id>& nodes = tbb::info::numa_nodes());

As far as I remember, tbb::info::numa_nodes() cannot return an empty vector, but rather vector containing only -1. However, in either case it seems that the implementation may just fallback onto NUMA-unspecified one time allocation call.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version of default argument leads to malloc/free on each alloc_interleaved(16*1024) call. That may be negligible, because of syscall internally. But could we eliminate excessive allocations?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default argument might be an empty vector; I believe (and we can check) that the major standard library implementations avoid allocation in that case.

But I think two overloads would be "cleaner" and perhaps a bit more efficient than any kind of default argument.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default argument might be an empty vector; I believe (and we can check) that the major standard library implementations avoid allocation in that case.

No allocation, but check against zero after returning from the function, because const & can became non-empty and we must call dtor.

So, set of overloads probably is less evil.

Comment on lines +47 to +48
Copy link
Copy Markdown
Contributor

@aleksei-fedotov aleksei-fedotov Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what parameter with default argument is going to be specified more often. Can it be the case when users specify different NUMA nodes more often than changing the default interleaving step? In this case, it is reasonable to have NUMA list to be the second function parameter, i.e. the first parameter with default argument, and have interleaving_step to be the last one.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With two overloads, either parameter can be omitted.

void free_interleaved(void *ptr, size_t size);
```
Comment thread
Alexandr-Konovalov marked this conversation as resolved.

## Implementation details

Under Linux, only allocations with default interleaving can be supported via HWLOC. Other
interleaving steps require direct libnuma usage, that creates yet another run-time
dependency. Using `move_pages` it's possible to implement allocation with constant number
of system calls wrt allocation size.

Under Windows, starting Windows 10 and WS 2016, `VirtualAlloc2(MEM_REPLACE_PLACEHOLDER)`
can be used to provide desired interleaving, but number of system calls is proportional to
allocation size. For older Windows, either fallback to `VirtualAlloc` or manual touching
from threads pre-pinned to NUMA nodes can be used.

There is no NUMA memory support under macOS, so the implementation can only fall back to
`malloc`.

## Open Questions

When non-default `interleaving step` can be used?

`size` argument for `free_interleaved()` appeared because what we have is wrappers over
`mmap`/`munmap` and there is no place to put the size after memory is allocated. We can
put it in, say, an internal cumap. Is it look useful?

Semantics of even distribution of data between NUMA nodes is straightforward: to equally
balance work between the nodes. Why might someone want to distribute data unequally? Can
it be a form of fine-tuning “node 0 already loaded with access to static data, let’s
decrease the load a little”?