Dispenso

A high-performance C++ thread pool and parallel algorithms library

Dispenso is a modern C++ parallel computing library that provides work-stealing thread pools, parallel for loops, futures, task graphs, and concurrent containers. It serves as a powerful alternative to OpenMP and Intel TBB, offering better nested parallelism, sanitizer-clean code, and explicit thread pool control. Dispenso is used in hundreds of projects at Meta (formerly Facebook) and has been heavily tested and iterated on in production.

Key advantages over OpenMP and TBB:

No thread explosion with nested parallel loops - dispenso's work-stealing prevents deadlocks and oversubscription
Clean with ASAN/TSAN - fully sanitizer-compatible, unlike many TBB versions
Thread-safe shared futures - std::experimental::shared_future-like API that TBB lacks, safe for multiple concurrent waiters, with much better performance than std::future
Portable - C++14 compatible with no compiler-specific pragmas or extensions; C++20 builds gain concept constraints for clearer error messages

Choose Dispenso If...

You need nested parallelism without thread explosion
You want sanitizer-clean (ASAN/TSAN) concurrent code
You want explicit control over thread pools rather than implicit global state
You need compute-bound futures, not I/O-bound async
You want stable APIs and minimal dependencies
You need cross-platform portability from a C++14 baseline
You have multiple independent parallel loops that can overlap (cascading parallel_for)

Features

Dispenso provides a comprehensive set of parallel programming primitives:

Core runtime:

ThreadPool — work-stealing thread pool backing all dispenso parallelism
TaskSet / ConcurrentTaskSet — task grouping with wait, cancellation, and recursive scheduling

Parallel algorithms:

parallel_for — parallel loops over indices, blocking or non-blocking (cascaded); cascading parallel_for enables overlapping independent loops without oversubscription
for_each — parallel std::for_each / std::for_each_n
Future — high-performance thread-safe shared futures with then(), when_all(), and an API matching std::experimental::shared_future
Graph — task graph execution with subgraph support and incremental re-evaluation
pipeline — parallel pipelining of streaming workloads

Concurrent containers and synchronization:

ConcurrentVector — concurrent growable vector, superset of TBB concurrent_vector API
Latch — one-shot barrier for thread synchronization
RWLock — reader-writer spin lock, outperforms std::shared_mutex under low write contention
SPSCRingBuffer — lock-free single-producer single-consumer ring buffer (1.5.0)

General-purpose utilities:

SmallVector — inline-storage vector (not thread-aware; similar to folly::small_vector) (1.5.0)
OnceFunction — lightweight move-only void() callable
PoolAllocator — pool allocator with pluggable backing allocation (e.g. CUDA)
SmallBufferAllocator — fast concurrent allocation for temporary objects
ResourcePool — semaphore-like guard around pooled resources
CompletionEvent — notifiable event with wait and timed wait
AsyncRequest — lightweight constrained message passing
ConcurrentObjectArena — fast same-type object arena

Quick Start

Parallel for loop - the most common use case:

#include <dispenso/parallel_for.h>

// Sequential
for (size_t i = 0; i < N; ++i) {
    process(data[i]);
}

// Parallel with dispenso - just wrap it!
dispenso::parallel_for(0, N, [&](size_t i) {
    process(data[i]);
});

Install via your favorite package manager:

# Conda
conda install -c conda-forge dispenso

# Fedora/RHEL
sudo dnf install dispenso-devel

# Or build from source (see below)

Comparison vs Other Libraries

TBB (Intel Threading Building Blocks)

TBB has more functionality overall, but we built dispenso for three reasons:

Sanitizer compatibility — TBB doesn't work well with ASAN/TSAN
Thread-safe shared futures — TBB lacks a futures interface; dispenso provides std::experimental::shared_future-like futures safe for multiple concurrent waiters
Non-Intel hardware — we needed to control performance on diverse platforms

Performance: Dispenso tends to be faster for small and medium parallel loops, and on par for large ones. When many loops run independently, dispenso's cascading parallel_for avoids oversubscription and has delivered 32-50% speedups in production workloads after porting from TBB at Meta. TBB lacks an equivalent mechanism.

See Migrating from TBB for a step-by-step porting guide.

OpenMP

OpenMP has simple syntax for basic loops but grows complex for advanced constructs. Nested #pragma omp parallel for inside threaded code risks thread explosion and machine exhaustion. Dispenso outperforms OpenMP for medium and large loops. OpenMP has an advantage for very small loops due to direct compiler support, though dispenso's minItemsPerChunk option can close this gap by tuning the parallelism threshold for small/fast loops.

See Migrating from OpenMP for a step-by-step porting guide.

Folly

Folly excels at asynchronous I/O with coroutine support. Dispenso is designed for compute-bound work. Dispenso's futures are lighter-weight and faster for compute workloads; Folly is the better choice for I/O-heavy applications.

TaskFlow

TaskFlow focuses on task graph execution. Dispenso has faster graph construction, faster full and partial graph execution, much lower parallel_for overhead (10-100x in benchmarks), and simpler/faster pipeline construction. TaskFlow does offer CUDA graph mappings, which dispenso does not currently provide.

Others (GCD, C++ std parallelism)

GCD is Apple-specific with ports to other platforms. C++ parallel algorithms are still evolving — we are interested in enabling dispenso as a backend for std::execution and C++ coroutines. Contributions and benchmarks are welcome.

Migration Guides

Migrating from TBB — API mappings, thread pool differences, and common porting patterns
Migrating from OpenMP — Replacing #pragma omp with dispenso equivalents, handling reductions and nested parallelism

When Not to Use Dispenso

Dispenso isn't really designed for high-latency task offload, it works best for compute-bound tasks. Using the thread pool for networking, disk, or in cases with frequent TLB misses (really any scenario with kernel context switches) may result in less than ideal performance.

In these kernel context switch scenarios, dispenso::Future can be used with dispenso::NewThreadInvoker, which should be roughly equivalent with std::future performance.

If you need async I/O, Folly is likely a good choice (though it still doesn't fix e.g. TLB misses).

Documentation and Examples

Documentation can be found here

Here are some simple examples of what you can do in dispenso. See tests and benchmarks for more examples.

parallel_for

A simple sequential loop can be parallelized with minimal changes:

for(size_t j = 0; j < kLoops; ++j) {
  vec[j] = someFunction(j);
}

Becomes:

dispenso::parallel_for(0, kLoops, [&vec] (size_t j) {
  vec[j] = someFunction(j);
});

TaskSet

Schedule multiple tasks and wait for them to complete:

void randomWorkConcurrently() {
  dispenso::TaskSet tasks(dispenso::globalThreadPool());
  tasks.schedule([&stateA]() { stateA = doA(); });
  tasks.schedule([]() { doB(); });
  // Do some work on current thread
  tasks.wait(); // After this, A, B done.
  tasks.schedule(doC);
  tasks.schedule([&stateD]() { doD(stateD); });
} // TaskSet's destructor waits for all scheduled tasks to finish

ConcurrentTaskSet

Build a tree in parallel using recursive task scheduling:

struct Node {
  int val;
  std::unique_ptr<Node> left, right;
};
void buildTree(dispenso::ConcurrentTaskSet& tasks, std::unique_ptr<Node>& node, int depth) {
  if (depth) {
    node = std::make_unique<Node>();
    node->val = depth;
    tasks.schedule([&tasks, &left = node->left, depth]() { buildTree(tasks, left, depth - 1); });
    tasks.schedule([&tasks, &right = node->right, depth]() { buildTree(tasks, right, depth - 1); });
  }
}
void buildTreeParallel() {
  std::unique_ptr<Node> root;
  dispenso::ConcurrentTaskSet tasks(dispenso::globalThreadPool());
  buildTree(tasks, root, 20);
  tasks.wait();  // tasks would also wait here in destructor if we omitted this line
}

Future

Compose asynchronous operations with futures:

dispenso::Future<size_t> ThingProcessor::processThings() {
  auto expensiveFuture = dispenso::async([this]() {
    return processExpensiveThing(expensive_);
  });
  auto futureOfManyCheap = dispenso::async([this]() {
    size_t sum = 0;
    for (auto &thing : cheapThings_) {
      sum += processCheapThing(thing);
    }
    return sum;
  });
  return dispenso::when_all(expensiveFuture, futureOfManyCheap).then([](auto &&tuple) {
    return std::get<0>(tuple).get() + std::get<1>(tuple).get();
  });
}

auto result = thingProc->processThings();
useResult(result.get());

ConcurrentVector

Safely grow a vector from multiple threads:

ConcurrentVector<std::unique_ptr<int>> values;
dispenso::parallel_for(
  dispenso::makeChunkedRange(0, length, dispenso::ParForChunking::kStatic),
  [&values](int i, int end) {
    values.grow_by_generator(end - i, [i]() mutable { return std::make_unique<int>(i++); });
  });

Benchmark Results

Dispenso is benchmarked across Linux (x64), macOS (ARM64), Windows (x64), and Android (ARM64), comparing against OpenMP, TBB, TaskFlow, folly, and std::async across thread pools, parallel loops, futures, graphs, concurrent containers, and more.

Interactive Benchmark Dashboard — explore all results with platform switching, dark/light theme, and detailed per-benchmark charts.

Installing

Binary builds of Dispenso are available through several package managers:

Conda: conda install -c conda-forge dispenso
Conan: conan install --requires=dispenso/1.5.0
vcpkg: vcpkg install dispenso
Homebrew: brew install dispenso
MacPorts: sudo port install dispenso
Fedora/RHEL: sudo dnf install dispenso-devel

If your platform is not on the list, see the next section for instructions to build from source.

Building

Linux and macOS:

mkdir build && cd build
cmake PATH_TO_DISPENSO_ROOT
make -j

Windows (from Developer Command Prompt):

mkdir build && cd build
cmake PATH_TO_DISPENSO_ROOT
cmake --build . --config Release

For detailed instructions including CMake prerequisites, installation, testing, and benchmarking, see docs/building.md.

Known Issues

A subset of dispenso tests are known to fail on 32-bit PPC Mac. If you have access to such a machine and are willing to help debug, it would be appreciated!

TODO

Enable Windows benchmarks through CMake. (may be resolved soon — actively being worked on)

License

The library is released under the MIT license, but also relies on the (excellent) moodycamel concurrentqueue library, which is released under the Simplified BSD and Zlib licenses. See the top of the source at dispenso/third-party/moodycamel/*.h for details.

Name		Name	Last commit message	Last commit date
Latest commit History 247 Commits
.github		.github
benchmarks		benchmarks
cmake		cmake
dispenso		dispenso
docs		docs
examples		examples
results		results
scripts		scripts
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
codecov.yml		codecov.yml
run_bench.bat		run_bench.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dispenso

Table of Contents

Choose Dispenso If...

Features

Quick Start

Comparison vs Other Libraries

TBB (Intel Threading Building Blocks)

OpenMP

Folly

TaskFlow

Others (GCD, C++ std parallelism)

Migration Guides

When Not to Use Dispenso

Documentation and Examples

parallel_for

TaskSet

ConcurrentTaskSet

Future

ConcurrentVector

Benchmark Results

Installing

Building

Known Issues

TODO

License

About

Uh oh!

Releases 8

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dispenso

Table of Contents

Choose Dispenso If...

Features

Quick Start

Comparison vs Other Libraries

TBB (Intel Threading Building Blocks)

OpenMP

Folly

TaskFlow

Others (GCD, C++ std parallelism)

Migration Guides

When Not to Use Dispenso

Documentation and Examples

parallel_for

TaskSet

ConcurrentTaskSet

Future

ConcurrentVector

Benchmark Results

Installing

Building

Known Issues

TODO

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages