feat(tonic-xds): add OutlierDetector sweep engine + failure-percentage algorithm (gRFC A50) by LYZJU2019 · Pull Request #2619 · hyperium/tonic

LYZJU2019 · 2026-04-30T21:53:56Z

Summary

Second of the gRFC A50: xDS Outlier Detection series. PR 1 (#2604) introduced the validated config types; this PR adds the sweep engine and the failure-percentage ejection algorithm. The success-rate algorithm — which adds float-math (mean / variance / standard deviation across qualifying hosts) — lands in a focused follow-up PR.

No integration yet. The data path still uses the existing P2C balancer; this module is exercised purely through unit tests. Wiring it into LoadBalancer (and the per-endpoint RPC outcome interception layer) is the subject of follow-up PRs.

What's in this PR

EndpointCounters — lock-free AtomicU64 success/failure counters. record_success / record_failure are called from the data path; snapshot_and_reset is called once per sweep.
EjectionDecision — Eject(addr) / Uneject(addr), the externally-visible output of a sweep.
OutlierDetector — owns the per-endpoint state map, the config, and an injectable Rng. add_endpoint(addr) -> Arc<EndpointCounters> returns the data-path handle; remove_endpoint(addr) unregisters.
run_sweep(now: Instant) -> Vec<EjectionDecision> — pure, no I/O. Implements the A50 sweep order verbatim:
1. Snapshot (swap counters) every endpoint.
2. Run the failure-percentage algorithm.
3. Apply ejections (set ejected_at, increment multiplier).
4. Step-5 housekeeping: decrement multiplier of every non-ejected address with multiplier > 0; un-eject addresses whose backoff has elapsed.
Spawned sweep loop (OutlierDetector::spawn / spawn_with_rng) — owns the channel sender and forwards each decision. The loop is scoped by an AbortOnDrop; the receiver closes when the task ends.

A50 compliance

Sweep step order matches the gRFC verbatim.
Failure-percentage uses strict > against the threshold (an address sitting exactly at the threshold is not ejected).
Multiplier decrement runs unconditionally on every non-ejected address with multiplier > 0, regardless of whether it received traffic that interval.
Ejection duration is min(base * multiplier, max(base, max_ejection_time)).
max_ejection_percent budgets concurrent ejections.

What's deliberately deferred

Success-rate algorithm. Adds float-math and 2 dedicated tests; lives cleanly in a follow-up PR. If OutlierDetectionConfig.success_rate is set on a cluster, this PR ignores it (documented in the module docstring).
Integration with LoadBalancer. The mpsc channel of EjectionDecisions polled by the LB (using EjectedChannel from tonic-xds: Implement subchannel state machine #2587) is its own PR.
RPC outcome interception. The Service wrapper that classifies HTTP/gRPC status into success/failure and increments EndpointCounters is its own PR.
Proto parsing into ClusterResource. Same as above — wired alongside integration.

Why `run_sweep` returns a `Vec` instead of sending on a channel

Keeping run_sweep pure has two payoffs: (a) algorithm tests don't need a channel or tokio time machinery, just detector.run_sweep(now); and (b) dropping the AbortOnDrop reliably closes the receiver, because the sender lives on the spawned task — not inside the OutlierDetector Arc that consumers may hold.

Test coverage (22 unit tests)

Counters: record + atomic snapshot-and-reset.
Endpoint registry: add_endpoint returns the same handle on second call, remove_endpoint clears state.
Failure-percentage algorithm: ejection above threshold; skip below threshold; boundary case — exactly at threshold does NOT eject; minimum_hosts and request_volume gates; enforcing_failure_percentage = 0 never ejects.
Ejection lifecycle: un-ejects after base × multiplier; same-sweep re-ejection grows the multiplier; max_ejection_time caps the duration; max_ejection_percent caps concurrent ejections; multiplier decrements on healthy intervals (with and without traffic).
Sweep loop: emits decisions on each tick (using tokio::time::pause()); dropping the AbortOnDrop closes the receiver.

Test plan

cargo test -p tonic-xds --lib outlier_detection — 22 passing tests.
cargo test -p tonic-xds --lib — full lib suite passes (no regressions).
cargo fmt -p tonic-xds --check clean.
cargo clippy -p tonic-xds --lib --all-features -- -D warnings clean.

Implement the core gRFC A50 outlier-detection algorithm: per-endpoint success/failure counters, the success-rate and failure-percentage ejection algorithms, the ejection-multiplier state machine, and a periodic sweep task that emits ejection/un-ejection decisions on a channel. `run_sweep` is pure (returns a Vec<EjectionDecision>); the sweep loop spawned by `OutlierDetector::spawn` owns the channel sender and forwards decisions, so dropping the returned `AbortOnDrop` ends the loop and closes the receiver. Tests drive `run_sweep` directly without the channel or tokio time mechanics. Algorithm coverage matches the gRFC: - Success-rate ejection with configurable `stdev_factor`, `enforcing_success_rate`, `minimum_hosts`, `request_volume`. - Failure-percentage ejection with `threshold`, `enforcing_failure_ percentage`, `minimum_hosts`, `request_volume`. - Ejection multiplier increments on each ejection, decays on healthy intervals; ejection duration is `base * multiplier` capped at `max(base, max_ejection_time)`. - `max_ejection_percent` caps total concurrent ejections. Probability rolls go through an injectable `Rng` trait (defaulting to `fastrand`) so tests can pin enforcement decisions. Standalone in this PR — no integration with the load balancer yet. That lands in a follow-up alongside the per-endpoint outcome interception layer. Refs: https://github.com/grpc/proposal/blob/master/A50-xds-outlier-detection.md

Address two follow-up review comments from hyperium#2604 (the merged config PR) by folding the doc updates into this PR: - Module docstring: describe the actual integration plan (an mpsc channel of EjectionDecisions polled by LoadBalancer, leveraging EjectedChannel) instead of the original "filter on the Discover stream" wording. Add intra-doc links to the relevant types. - enforcing_success_rate / enforcing_failure_percentage: clarify that each is the *enforcement probability* — distinct from the per-algorithm threshold (stdev_factor for success-rate, threshold for failure-percentage). Note that 0 disables enforcement while still computing statistics. Also fix an unresolved intra-doc link in the algorithm module.

Three spec-compliance fixes to `run_sweep` and the failure-percentage algorithm: 1. Reorder the sweep to match A50 step order: snapshot counters → run success-rate algorithm → run failure-percentage algorithm → step-5 housekeeping (decrement non-ejected multipliers, un-eject elapsed ejections). The previous order (un-eject before algorithms) caused spurious `Uneject` decisions whenever the same sweep also re-ejected the address. Per spec, re-ejection refreshes `ejected_at` to `now` before the un-eject check runs, so no transient un-eject is emitted. 2. Drop the `total > 0` traffic gate from the multiplier-decrement step. A50 says a non-ejected address with multiplier > 0 has its multiplier decremented every sweep, regardless of whether it received traffic that interval. 3. Failure-percentage now uses strict `>` against the threshold (was `>=`). Per A50: "If the address's failure percentage is greater than `failure_percentage_ejection.threshold`..." — an address sitting exactly at the threshold is not ejected. Also: drop the explicit "skip ejected hosts from candidate list" pre- filter. Per spec the algorithms iterate every address; ejected hosts naturally fail the `request_volume` gate since they receive no traffic in production. Behavior on real workloads is unchanged. Test changes: - `re_ejection_doubles_duration` now asserts a single `Eject` decision (no transient `Uneject`) under the corrected sweep order. - New `failure_percentage_at_threshold_does_not_eject` covers the strict-`>` boundary. - New `multiplier_decrements_even_without_traffic` covers the no-traffic-gate fix.

Drop the success-rate algorithm and its tests from this PR so the outlier-detection PR is minimal and stand-alone. The scaffolding (sweep loop, multiplier state, counters, max-ejection-percent budget) is unchanged and still exercised by the failure-percentage algorithm plus the multiplier / un-eject / cap tests. If `OutlierDetectionConfig.success_rate` is set on the cluster, it is currently ignored. Documented in the module docstring with a pointer to the follow-up PR. Removes: - `OutlierDetector::run_success_rate` (mean / variance / sqrt math). - `success_rate` dispatch in `run_sweep`. - `run_failure_percentage`'s `!out.contains` filter — dead now that only one algorithm runs per sweep. - `success_rate_ejects_outlier_below_threshold` test. - `success_rate_no_ejection_when_all_uniform` test. - The `sr_config` test helper. - Unused `SuccessRateConfig` import.

Switch from `mpsc::unbounded_channel` to `mpsc::channel(256)` for the ejection-decision stream that the sweep loop emits. The decisions are edge-triggered (`Eject`/`Uneject` transitions, not state snapshots), so the consumer must process every event in order; we can't drop or coalesce. But we don't want unbounded memory growth either if the consumer stalls. A bounded channel gives us: - Same correctness as unbounded — no events dropped, ordered delivery. - Bounded memory. - Natural backpressure: when the buffer fills, `tx.send().await` parks the sweep task, which (combined with `MissedTickBehavior:: Skip`) throttles sweep cadence to whatever rate the consumer can drain. Computing more decisions than the consumer can apply just widens the desync. Capacity is 256 — at most `2 * num_endpoints` decisions per sweep, so this buffers several sweeps' worth of decisions for clusters of typical size. A docstring on `DECISIONS_CHANNEL_CAPACITY` captures the rationale for future readers.

Replace `spawn_with_rng` with `spawn_with`, taking an `OutlierDetectorOptions` struct that bundles the RNG and the new configurable `decisions_channel_capacity`. Defaults are unchanged (`fastrand` RNG, capacity 256). The hard-coded constant becomes `DEFAULT_DECISIONS_CHANNEL_CAPACITY` and is no longer the only knob — production callers may want to bump the bound for clusters with very large endpoint sets (worst case `2 * num_endpoints` decisions per sweep) or unusually slow consumers. Using a struct instead of a long argument list means future runtime knobs (custom Tokio runtime, alternate backoff policies, observability hooks, …) can be added without breaking call sites — callers typically construct via `..Default::default()`. The xDS-derived `OutlierDetectionConfig` stays separate from these host-side runtime knobs, keeping a clean line between "what the xDS proto specifies" and "how this binary chooses to host it."

…tests Both `sweep_loop_emits_decisions_on_tick` and `dropping_abort_stops_sweep_loop` previously used `tokio::time::sleep` in `start_paused = true` mode. That works through the runtime's auto-advance heuristic for parked tasks, but the heuristic is sensitive to the order of pending wake-ups across multiple tasks and can be flaky in practice. - `sweep_loop_emits_decisions_on_tick`: switch to `tokio::time::advance(150ms)` which explicitly moves the clock and yields until pending wake-ups have been polled — deterministic. - `dropping_abort_stops_sweep_loop`: drop the artificial sleep altogether. Aborting the JoinHandle wakes the spawned task synchronously; the runtime polls it, the harness observes the abort, and the task ends — dropping its sender. `rx.recv().await` parks briefly while that happens and then returns `None`. No time advancement needed. Stress-tested both tests 50× back-to-back: all pass.

Rewrite the doc comment to be reference documentation rather than a design narrative. Drops the editorializing ("the right behavior") and the first-person reasoning, keeps the three things a developer needs: what the constant controls, why this size, what happens at capacity (and why decisions can't be dropped or coalesced), and how to override.

The previous design used two separate `AtomicU64`s and snapshotted via two independent `swap` calls — the doc comment claimed this was atomic across the pair, but it isn't: an RPC completing between the two swaps inflates the next snapshot by one event, biasing the failure-percentage computation slightly under contention. Pack both counters into one `AtomicU64` (high 32 bits: successes, low 32 bits: failures). `record_*` becomes a single `fetch_add` (same hot- path cost as before), `snapshot_and_reset` becomes a single `swap(0)`, and the snapshot is now genuinely atomic across the pair — matching the bucket-swap semantics the gRFC describes. Each counter is capped at `u32::MAX` per sweep interval. Exceeding it would carry into the other counter's bits, but the cap is unreachable for realistic workloads (> 4 × 10⁹ RPCs to one endpoint within one interval). Documented on the struct.

Guard the `100 * failure / total` division against `total == 0`. gRFC A50 doesn't forbid `request_volume == 0`, in which case the qualifying filter `c.total >= request_volume` admits candidates with zero traffic; the spec is silent on `0/0`, so skip those endpoints rather than panic.

gu0keno0 · 2026-05-04T22:27:17Z

+    }
+
+    /// Variant of [`Self::spawn`] that accepts custom runtime options.
+    pub(crate) fn spawn_with(


What about we just don't use these spawned loops at all, and just using RPCs as ticks for the outlier detection window? You can still use wallclock time, just that instead of ticking every second, tick it with every RPC call.

This works because tonic-xds only need to deal with all ejection signals in loadbalancer.poll_ready(), which already take mut self of loadbalancer, the decision of outlier detection will be global and without race conditions.

…tests Drop the test-only `sort` helper that compared `EjectionDecision`s by their `Debug` string representation, which was fragile (any change to the `Debug` impl would silently change ordering). Derive `PartialOrd` and `Ord` on `EjectionDecision` (and on `EndpointAddress` / `EndpointHost`, since the address is the inner field) and call `Vec::sort` directly at the one test site.

When an already-ejected endpoint has in-flight RPCs that complete during its ejection backoff, those completions accumulate on its counter. At the next sweep the algorithm may "re-eject" the host (refreshing its `ejected_at` timestamp and bumping the multiplier). That action does not change the count of currently-ejected addresses, so per A50's `max_ejection_percent` check it must not consume a slot in the cap — but the previous code decremented the budget for it, under-counting how many *new* ejections the cap allows. Track the pre-sweep ejection state on each `Candidate` and only decrement the budget for new ejections in the failure-percentage algorithm. Add a regression test covering the specific scenario.

Replace the spawned sweep loop + mpsc channel with an on-demand model: the detector exposes `maybe_run_sweep(&mut self, now: Instant) -> Vec <EjectionDecision>` and the consumer (the load balancer in a follow-up PR) calls it from its own event loop — typically `poll_ready` — gated by wallclock time. This eliminates a significant amount of machinery: - `tokio::spawn`, `sweep_loop`, `AbortOnDrop`, the mpsc channel. - The bounded-channel capacity option, its constant, and its docs (`OutlierDetectorOptions::decisions_channel_capacity`, `DEFAULT_DECISIONS_CHANNEL_CAPACITY`). - `OutlierDetectorOptions` itself — collapses to two constructors `new(config)` and `with_rng(config, rng)`. - The `Mutex` on `state` — the consumer's `&mut self` already serializes access. - Two `#[tokio::test(start_paused = true)]` tests that exercised the spawned task and its abort handle. Sweep timing now depends on RPC traffic: when no RPCs flow, no sweeps run. This matches A50's intent (sweeps happen approximately every `interval` while traffic is flowing) and is observably equivalent because ejection only matters during endpoint picking, which only happens during RPCs. Suggested by the PR review. Tests: - All algorithm-level tests rewritten to use owned `OutlierDetector` + `&mut self` calls, no `Mutex::lock()`, no Arc. - Three new `maybe_run_sweep_*` tests cover the interval gate: runs on first call, skips before interval elapsed, runs after. - Existing failure-percentage and multiplier/un-ejection tests unchanged in spirit; just adjusted to the new ownership model.

Pass through every doc comment and inline comment, removing rationale, timeline language, and explanations that don't help a future reader. Notable trims: - Module docstring drops "Knows nothing about the data path:" framing, the "lands in a follow-up PR" timeline (regression — flagged and removed earlier on a different doc), and the "(mean and standard deviation across the qualifying hosts)" parenthetical. - `Rng` trait drops the "Abstracted so tests can inject" rationale. - `OutlierDetector` struct drops "State is owned (no `Mutex`, no `Arc`):" framing. - `add_endpoint` / `remove_endpoint` / `with_rng` lose the trailing usage hints / explanatory parentheticals. - `maybe_run_sweep` / `run_sweep` tightened to facts-only. - Inline comments inside `run_sweep` drop "we model that" and "intentionally not yet dispatched in this PR" timeline. - Inline comment for the budget-decrement guard now points at `Candidate::already_ejected` instead of duplicating its doc. - Test `already_ejected_re_ejection_does_not_consume_budget` drops the "this would fail before the fix" git-history paragraph.

LYZJU2019 added 4 commits April 30, 2026 14:53

LYZJU2019 changed the title ~~feat(tonic-xds): add OutlierDetector sweep engine (gRFC A50)~~ feat(tonic-xds): add OutlierDetector sweep engine + failure-percentage algorithm (gRFC A50) Apr 30, 2026

LYZJU2019 added 3 commits April 30, 2026 16:22