A high-performance C++ thread pool and parallel algorithms library
Dispenso is a modern C++ parallel computing library that provides work-stealing thread pools, parallel for loops, futures, task graphs, and concurrent containers. It serves as a powerful alternative to OpenMP and Intel TBB, offering better nested parallelism, sanitizer-clean code, and explicit thread pool control. Dispenso is used in hundreds of projects at Meta (formerly Facebook) and has been heavily tested and iterated on in production.
Key advantages over OpenMP and TBB:
- No thread explosion with nested parallel loops - dispenso's work-stealing prevents deadlocks and oversubscription
- Clean with ASAN/TSAN - fully sanitizer-compatible, unlike many TBB versions
- Thread-safe shared futures -
std::experimental::shared_future-like API that TBB lacks, safe for multiple concurrent waiters, with much better performance thanstd::future - Portable - C++14 compatible with no compiler-specific pragmas or extensions; C++20 builds gain concept constraints for clearer error messages
- Choose Dispenso If...
- Features
- Quick Start
- Comparison vs Other Libraries
- Migration Guides
- When Not to Use Dispenso
- Documentation and Examples
- Benchmark Results
- Installing
- Building
- Known Issues
- License
- You need nested parallelism without thread explosion
- You want sanitizer-clean (ASAN/TSAN) concurrent code
- You want explicit control over thread pools rather than implicit global state
- You need compute-bound futures, not I/O-bound async
- You want stable APIs and minimal dependencies
- You need cross-platform portability from a C++14 baseline
- You have multiple independent parallel loops that can overlap (cascading
parallel_for)
Dispenso provides a comprehensive set of parallel programming primitives:
Core runtime:
ThreadPool— work-stealing thread pool backing all dispenso parallelismTaskSet/ConcurrentTaskSet— task grouping with wait, cancellation, and recursive scheduling
Parallel algorithms:
parallel_for— parallel loops over indices, blocking or non-blocking (cascaded); cascadingparallel_forenables overlapping independent loops without oversubscriptionfor_each— parallelstd::for_each/std::for_each_nFuture— high-performance thread-safe shared futures withthen(),when_all(), and an API matchingstd::experimental::shared_futureGraph— task graph execution with subgraph support and incremental re-evaluationpipeline— parallel pipelining of streaming workloads
Concurrent containers and synchronization:
ConcurrentVector— concurrent growable vector, superset of TBBconcurrent_vectorAPILatch— one-shot barrier for thread synchronizationRWLock— reader-writer spin lock, outperformsstd::shared_mutexunder low write contentionSPSCRingBuffer— lock-free single-producer single-consumer ring buffer (1.5.0)
General-purpose utilities:
SmallVector— inline-storage vector (not thread-aware; similar tofolly::small_vector) (1.5.0)OnceFunction— lightweight move-onlyvoid()callablePoolAllocator— pool allocator with pluggable backing allocation (e.g. CUDA)SmallBufferAllocator— fast concurrent allocation for temporary objectsResourcePool— semaphore-like guard around pooled resourcesCompletionEvent— notifiable event with wait and timed waitAsyncRequest— lightweight constrained message passingConcurrentObjectArena— fast same-type object arena
Parallel for loop - the most common use case:
#include <dispenso/parallel_for.h>
// Sequential
for (size_t i = 0; i < N; ++i) {
process(data[i]);
}
// Parallel with dispenso - just wrap it!
dispenso::parallel_for(0, N, [&](size_t i) {
process(data[i]);
});Install via your favorite package manager:
# Conda
conda install -c conda-forge dispenso
# Fedora/RHEL
sudo dnf install dispenso-devel
# Or build from source (see below)TBB has more functionality overall, but we built dispenso for three reasons:
- Sanitizer compatibility — TBB doesn't work well with ASAN/TSAN
- Thread-safe shared futures — TBB lacks a futures interface; dispenso provides
std::experimental::shared_future-like futures safe for multiple concurrent waiters - Non-Intel hardware — we needed to control performance on diverse platforms
Performance: Dispenso tends to be faster for small and medium parallel loops, and on par for large ones. When many loops run independently, dispenso's cascading parallel_for avoids oversubscription and has delivered 32-50% speedups in production workloads after porting from TBB at Meta. TBB lacks an equivalent mechanism.
See Migrating from TBB for a step-by-step porting guide.
OpenMP has simple syntax for basic loops but grows complex for advanced constructs. Nested #pragma omp parallel for inside threaded code risks thread explosion and machine exhaustion. Dispenso outperforms OpenMP for medium and large loops. OpenMP has an advantage for very small loops due to direct compiler support, though dispenso's minItemsPerChunk option can close this gap by tuning the parallelism threshold for small/fast loops.
See Migrating from OpenMP for a step-by-step porting guide.
Folly excels at asynchronous I/O with coroutine support. Dispenso is designed for compute-bound work. Dispenso's futures are lighter-weight and faster for compute workloads; Folly is the better choice for I/O-heavy applications.
TaskFlow focuses on task graph execution. Dispenso has faster graph construction, faster full and partial graph execution, much lower parallel_for overhead (10-100x in benchmarks), and simpler/faster pipeline construction. TaskFlow does offer CUDA graph mappings, which dispenso does not currently provide.
GCD is Apple-specific with ports to other platforms. C++ parallel algorithms are still evolving — we are interested in enabling dispenso as a backend for std::execution and C++ coroutines. Contributions and benchmarks are welcome.
- Migrating from TBB — API mappings, thread pool differences, and common porting patterns
- Migrating from OpenMP — Replacing
#pragma ompwith dispenso equivalents, handling reductions and nested parallelism
Dispenso isn't really designed for high-latency task offload, it works best for compute-bound tasks. Using the thread pool for networking, disk, or in cases with frequent TLB misses (really any scenario with kernel context switches) may result in less than ideal performance.
In these kernel context switch scenarios, dispenso::Future can be used with dispenso::NewThreadInvoker, which should be roughly equivalent with std::future performance.
If you need async I/O, Folly is likely a good choice (though it still doesn't fix e.g. TLB misses).
Documentation can be found here
Here are some simple examples of what you can do in dispenso. See tests and benchmarks for more examples.
A simple sequential loop can be parallelized with minimal changes:
for(size_t j = 0; j < kLoops; ++j) {
vec[j] = someFunction(j);
}Becomes:
dispenso::parallel_for(0, kLoops, [&vec] (size_t j) {
vec[j] = someFunction(j);
});Schedule multiple tasks and wait for them to complete:
void randomWorkConcurrently() {
dispenso::TaskSet tasks(dispenso::globalThreadPool());
tasks.schedule([&stateA]() { stateA = doA(); });
tasks.schedule([]() { doB(); });
// Do some work on current thread
tasks.wait(); // After this, A, B done.
tasks.schedule(doC);
tasks.schedule([&stateD]() { doD(stateD); });
} // TaskSet's destructor waits for all scheduled tasks to finishBuild a tree in parallel using recursive task scheduling:
struct Node {
int val;
std::unique_ptr<Node> left, right;
};
void buildTree(dispenso::ConcurrentTaskSet& tasks, std::unique_ptr<Node>& node, int depth) {
if (depth) {
node = std::make_unique<Node>();
node->val = depth;
tasks.schedule([&tasks, &left = node->left, depth]() { buildTree(tasks, left, depth - 1); });
tasks.schedule([&tasks, &right = node->right, depth]() { buildTree(tasks, right, depth - 1); });
}
}
void buildTreeParallel() {
std::unique_ptr<Node> root;
dispenso::ConcurrentTaskSet tasks(dispenso::globalThreadPool());
buildTree(tasks, root, 20);
tasks.wait(); // tasks would also wait here in destructor if we omitted this line
}Compose asynchronous operations with futures:
dispenso::Future<size_t> ThingProcessor::processThings() {
auto expensiveFuture = dispenso::async([this]() {
return processExpensiveThing(expensive_);
});
auto futureOfManyCheap = dispenso::async([this]() {
size_t sum = 0;
for (auto &thing : cheapThings_) {
sum += processCheapThing(thing);
}
return sum;
});
return dispenso::when_all(expensiveFuture, futureOfManyCheap).then([](auto &&tuple) {
return std::get<0>(tuple).get() + std::get<1>(tuple).get();
});
}
auto result = thingProc->processThings();
useResult(result.get());Safely grow a vector from multiple threads:
ConcurrentVector<std::unique_ptr<int>> values;
dispenso::parallel_for(
dispenso::makeChunkedRange(0, length, dispenso::ParForChunking::kStatic),
[&values](int i, int end) {
values.grow_by_generator(end - i, [i]() mutable { return std::make_unique<int>(i++); });
});Dispenso is benchmarked across Linux (x64), macOS (ARM64), Windows (x64), and Android (ARM64),
comparing against OpenMP, TBB, TaskFlow, folly, and std::async across thread pools, parallel
loops, futures, graphs, concurrent containers, and more.
Interactive Benchmark Dashboard — explore all results with platform switching, dark/light theme, and detailed per-benchmark charts.
Binary builds of Dispenso are available through several package managers:
- Conda:
conda install -c conda-forge dispenso - Conan:
conan install --requires=dispenso/1.5.0 - vcpkg:
vcpkg install dispenso - Homebrew:
brew install dispenso - MacPorts:
sudo port install dispenso - Fedora/RHEL:
sudo dnf install dispenso-devel
If your platform is not on the list, see the next section for instructions to build from source.
Linux and macOS:
mkdir build && cd build
cmake PATH_TO_DISPENSO_ROOT
make -jWindows (from Developer Command Prompt):
mkdir build && cd build
cmake PATH_TO_DISPENSO_ROOT
cmake --build . --config ReleaseFor detailed instructions including CMake prerequisites, installation, testing, and benchmarking, see docs/building.md.
- A subset of dispenso tests are known to fail on 32-bit PPC Mac. If you have access to such a machine and are willing to help debug, it would be appreciated!
- Enable Windows benchmarks through CMake. (may be resolved soon — actively being worked on)