-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add sub-RFC for interleaving memory allocation #2032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
55b3f1c
21eed50
6ef4042
7a5c7d6
4dc0ce3
b370fec
5d2979b
8aa0e21
99b944f
c0a2c17
f12e0b8
6a79202
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,78 @@ | ||||||||||
| # API to allocate memory interleaved between NUMA nodes | ||||||||||
|
|
||||||||||
| *Note:* This document is a sub-RFC of the [umbrella RFC about improving NUMA | ||||||||||
| support](README.md). | ||||||||||
|
|
||||||||||
| ## Motivation | ||||||||||
|
|
||||||||||
| There are two kinds of NUMA-related performance bottlenecks: latency increasing due to | ||||||||||
| access to a remote node and bandwidth-limited simultaneous access from different CPUs to | ||||||||||
| a single NUMA memory node. A well-known method to mitigate both is a distribution of | ||||||||||
| memory objects that are accessed from different CPUs to different NUMA nodes in such a way | ||||||||||
| that matches an access pattern. If the access pattern is complex enough, a simple | ||||||||||
| round-robin distribution can be good enough. The distribution can be achieved either by | ||||||||||
| employing a first-touch policy of NUMA memory allocation or via special platform-dependent | ||||||||||
| API. Generally, the latter requires less overhead. | ||||||||||
|
|
||||||||||
| ## Requirements to public API | ||||||||||
|
|
||||||||||
| A free stateless function, similar to malloc, is sufficient for the allocation of large | ||||||||||
| blocks of memory, contiguous in the address space. To guide the mapping of memory | ||||||||||
| across NUMA nodes, two additional parameters are proposed: `interleaving step` | ||||||||||
| and `the list of NUMA nodes to get the memory from`. This function allocates whole | ||||||||||
| memory pages and does not employ internal caching. If smaller and repetitive allocations | ||||||||||
| are needed, then `std::pmr` or other solutions should be used. | ||||||||||
|
|
||||||||||
| `interleaving step` is the size of the contiguous memory block from a particular NUMA | ||||||||||
| node, it has page granularity. Currently there are no clear use cases for granularity more | ||||||||||
| than page size. | ||||||||||
|
|
||||||||||
| `list of nodes for allocation` is `std::vector<tbb::numa_node_id>` to be compatible with a | ||||||||||
| value returned from `tbb::numa_nodes()`. `libnuma` supports a subset of NUMA nodes for | ||||||||||
| allocation, but those nodes are loaded equally. Having `vector` allows us to express an | ||||||||||
| unbalanced load. Example: allocation over the list of nodes [3, 0, 3] uses 2/3 memory from | ||||||||||
| node 3 and 1/3 from node 0. | ||||||||||
|
|
||||||||||
| One use case for `list of nodes` argument is the desire to run parallel activity on subset | ||||||||||
| of nodes and so get memory only from those nodes. | ||||||||||
|
|
||||||||||
| Most common usage of the allocation function is expected only with `size` parameter. | ||||||||||
| In this case, `interleaving_step` defaults to the page size and memory is allocated on all | ||||||||||
| NUMA nodes. | ||||||||||
|
Comment on lines
+40
to
+41
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe using a named constant instead of a zero as the default argument can improve code readability. const size_t page_size = []() {
// Find out what page size is and return it
size_t found_page_size = ...;
return found_page_size;
};Then we can specify it as the default value for void *alloc_interleaved(size_t size, size_t interleaving_step = page_size,
const std::vector<tbb::numa_node_id> *nodes = nullptr);
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we see it as that? No named constants, and no need to define which argument is more often has non-default values. TBB_EXPORT void *__TBB_EXPORTED_FUNC alloc_interleaved(size_t size, size_t interleaving_step,
const tbb::detail::d1::numa_node_id *nodes, size_t nodes_count);
inline void *alloc_interleaved(size_t size) {
return alloc_interleaved(size, 0, nullptr, 0);
}
inline void *alloc_interleaved(size_t size, const std::vector<tbb::detail::d1::numa_node_id> &nodes) {
return alloc_interleaved(size, 0, nodes.data(), nodes.size());
}
inline void *alloc_interleaved(size_t size, size_t interleaving_step) {
return alloc_interleaved(size, interleaving_step, nullptr, 0);
}
inline void *alloc_interleaved(size_t size, size_t interleaving_step,
const std::vector<tbb::detail::d1::numa_node_id> &nodes) {
return alloc_interleaved(size, interleaving_step, nodes.data(), nodes.size());
}
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is a good proposal! I like the part where we don't need to define which argument is more often used as all the variants are available. When I look at this, only two disadvantages come to my mind, which might not be important to address at all:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you, good points. |
||||||||||
|
|
||||||||||
| The following functions are provided to illustrate the conceptual API, not yet as the | ||||||||||
| recommended new API. | ||||||||||
|
|
||||||||||
| ```c++ | ||||||||||
| void *alloc_interleaved(size_t size, size_t interleaving_step = 0, | ||||||||||
| const std::vector<tbb::numa_node_id> *nodes = nullptr); | ||||||||||
|
Comment on lines
+47
to
+48
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I saw the discussion regarding API and have seen that RFC now illustrates rather mandates the interface, but I still believe we can write:
Suggested change
As far as I remember,
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This version of default argument leads to malloc/free on each
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The default argument might be an empty vector; I believe (and we can check) that the major standard library implementations avoid allocation in that case. But I think two overloads would be "cleaner" and perhaps a bit more efficient than any kind of default argument.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
No allocation, but check against zero after returning from the function, because So, set of overloads probably is less evil.
Comment on lines
+47
to
+48
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder what parameter with default argument is going to be specified more often. Can it be the case when users specify different NUMA nodes more often than changing the default interleaving step? In this case, it is reasonable to have NUMA list to be the second function parameter, i.e. the first parameter with default argument, and have
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With two overloads, either parameter can be omitted. |
||||||||||
| void free_interleaved(void *ptr, size_t size); | ||||||||||
| ``` | ||||||||||
|
Alexandr-Konovalov marked this conversation as resolved.
|
||||||||||
|
|
||||||||||
| ## Implementation details | ||||||||||
|
|
||||||||||
| Under Linux, only allocations with default interleaving can be supported via HWLOC. Other | ||||||||||
| interleaving steps require direct libnuma usage, that creates yet another run-time | ||||||||||
| dependency. Using `move_pages` it's possible to implement allocation with constant number | ||||||||||
| of system calls wrt allocation size. | ||||||||||
|
|
||||||||||
| Under Windows, starting Windows 10 and WS 2016, `VirtualAlloc2(MEM_REPLACE_PLACEHOLDER)` | ||||||||||
| can be used to provide desired interleaving, but number of system calls is proportional to | ||||||||||
| allocation size. For older Windows, either fallback to `VirtualAlloc` or manual touching | ||||||||||
| from threads pre-pinned to NUMA nodes can be used. | ||||||||||
|
|
||||||||||
| There is no NUMA memory support under macOS, so the implementation can only fall back to | ||||||||||
| `malloc`. | ||||||||||
|
|
||||||||||
| ## Open Questions | ||||||||||
|
|
||||||||||
| When non-default `interleaving step` can be used? | ||||||||||
|
|
||||||||||
| `size` argument for `free_interleaved()` appeared because what we have is wrappers over | ||||||||||
| `mmap`/`munmap` and there is no place to put the size after memory is allocated. We can | ||||||||||
| put it in, say, an internal cumap. Is it look useful? | ||||||||||
|
|
||||||||||
| Semantics of even distribution of data between NUMA nodes is straightforward: to equally | ||||||||||
| balance work between the nodes. Why might someone want to distribute data unequally? Can | ||||||||||
| it be a form of fine-tuning “node 0 already loaded with access to static data, let’s | ||||||||||
| decrease the load a little”? | ||||||||||
Uh oh!
There was an error while loading. Please reload this page.