feat(tonic-xds): add OutlierDetector sweep engine + failure-percentage algorithm (gRFC A50)#2619
Open
LYZJU2019 wants to merge 18 commits intohyperium:masterfrom
Open
feat(tonic-xds): add OutlierDetector sweep engine + failure-percentage algorithm (gRFC A50)#2619LYZJU2019 wants to merge 18 commits intohyperium:masterfrom
LYZJU2019 wants to merge 18 commits intohyperium:masterfrom
Conversation
Implement the core gRFC A50 outlier-detection algorithm: per-endpoint
success/failure counters, the success-rate and failure-percentage
ejection algorithms, the ejection-multiplier state machine, and a
periodic sweep task that emits ejection/un-ejection decisions on a
channel.
`run_sweep` is pure (returns a Vec<EjectionDecision>); the sweep loop
spawned by `OutlierDetector::spawn` owns the channel sender and
forwards decisions, so dropping the returned `AbortOnDrop` ends the
loop and closes the receiver. Tests drive `run_sweep` directly without
the channel or tokio time mechanics.
Algorithm coverage matches the gRFC:
- Success-rate ejection with configurable `stdev_factor`,
`enforcing_success_rate`, `minimum_hosts`, `request_volume`.
- Failure-percentage ejection with `threshold`, `enforcing_failure_
percentage`, `minimum_hosts`, `request_volume`.
- Ejection multiplier increments on each ejection, decays on healthy
intervals; ejection duration is `base * multiplier` capped at
`max(base, max_ejection_time)`.
- `max_ejection_percent` caps total concurrent ejections.
Probability rolls go through an injectable `Rng` trait (defaulting to
`fastrand`) so tests can pin enforcement decisions.
Standalone in this PR — no integration with the load balancer yet.
That lands in a follow-up alongside the per-endpoint outcome
interception layer.
Refs: https://github.com/grpc/proposal/blob/master/A50-xds-outlier-detection.md
Address two follow-up review comments from hyperium#2604 (the merged config PR) by folding the doc updates into this PR: - Module docstring: describe the actual integration plan (an mpsc channel of EjectionDecisions polled by LoadBalancer, leveraging EjectedChannel) instead of the original "filter on the Discover stream" wording. Add intra-doc links to the relevant types. - enforcing_success_rate / enforcing_failure_percentage: clarify that each is the *enforcement probability* — distinct from the per-algorithm threshold (stdev_factor for success-rate, threshold for failure-percentage). Note that 0 disables enforcement while still computing statistics. Also fix an unresolved intra-doc link in the algorithm module.
Three spec-compliance fixes to `run_sweep` and the failure-percentage
algorithm:
1. Reorder the sweep to match A50 step order: snapshot counters → run
success-rate algorithm → run failure-percentage algorithm → step-5
housekeeping (decrement non-ejected multipliers, un-eject elapsed
ejections). The previous order (un-eject before algorithms) caused
spurious `Uneject` decisions whenever the same sweep also re-ejected
the address. Per spec, re-ejection refreshes `ejected_at` to `now`
before the un-eject check runs, so no transient un-eject is emitted.
2. Drop the `total > 0` traffic gate from the multiplier-decrement
step. A50 says a non-ejected address with multiplier > 0 has its
multiplier decremented every sweep, regardless of whether it
received traffic that interval.
3. Failure-percentage now uses strict `>` against the threshold (was
`>=`). Per A50: "If the address's failure percentage is greater
than `failure_percentage_ejection.threshold`..." — an address
sitting exactly at the threshold is not ejected.
Also: drop the explicit "skip ejected hosts from candidate list" pre-
filter. Per spec the algorithms iterate every address; ejected hosts
naturally fail the `request_volume` gate since they receive no traffic
in production. Behavior on real workloads is unchanged.
Test changes:
- `re_ejection_doubles_duration` now asserts a single `Eject`
decision (no transient `Uneject`) under the corrected sweep order.
- New `failure_percentage_at_threshold_does_not_eject` covers the
strict-`>` boundary.
- New `multiplier_decrements_even_without_traffic` covers the
no-traffic-gate fix.
Drop the success-rate algorithm and its tests from this PR so the
outlier-detection PR is minimal and stand-alone. The scaffolding
(sweep loop, multiplier state, counters, max-ejection-percent budget)
is unchanged and still exercised by the failure-percentage algorithm
plus the multiplier / un-eject / cap tests.
If `OutlierDetectionConfig.success_rate` is set on the cluster, it is
currently ignored. Documented in the module docstring with a pointer
to the follow-up PR.
Removes:
- `OutlierDetector::run_success_rate` (mean / variance / sqrt math).
- `success_rate` dispatch in `run_sweep`.
- `run_failure_percentage`'s `!out.contains` filter — dead now that
only one algorithm runs per sweep.
- `success_rate_ejects_outlier_below_threshold` test.
- `success_rate_no_ejection_when_all_uniform` test.
- The `sr_config` test helper.
- Unused `SuccessRateConfig` import.
Switch from `mpsc::unbounded_channel` to `mpsc::channel(256)` for the
ejection-decision stream that the sweep loop emits.
The decisions are edge-triggered (`Eject`/`Uneject` transitions, not
state snapshots), so the consumer must process every event in order;
we can't drop or coalesce. But we don't want unbounded memory growth
either if the consumer stalls. A bounded channel gives us:
- Same correctness as unbounded — no events dropped, ordered delivery.
- Bounded memory.
- Natural backpressure: when the buffer fills, `tx.send().await`
parks the sweep task, which (combined with `MissedTickBehavior::
Skip`) throttles sweep cadence to whatever rate the consumer can
drain. Computing more decisions than the consumer can apply just
widens the desync.
Capacity is 256 — at most `2 * num_endpoints` decisions per sweep, so
this buffers several sweeps' worth of decisions for clusters of typical
size. A docstring on `DECISIONS_CHANNEL_CAPACITY` captures the
rationale for future readers.
Replace `spawn_with_rng` with `spawn_with`, taking an `OutlierDetectorOptions` struct that bundles the RNG and the new configurable `decisions_channel_capacity`. Defaults are unchanged (`fastrand` RNG, capacity 256). The hard-coded constant becomes `DEFAULT_DECISIONS_CHANNEL_CAPACITY` and is no longer the only knob — production callers may want to bump the bound for clusters with very large endpoint sets (worst case `2 * num_endpoints` decisions per sweep) or unusually slow consumers. Using a struct instead of a long argument list means future runtime knobs (custom Tokio runtime, alternate backoff policies, observability hooks, …) can be added without breaking call sites — callers typically construct via `..Default::default()`. The xDS-derived `OutlierDetectionConfig` stays separate from these host-side runtime knobs, keeping a clean line between "what the xDS proto specifies" and "how this binary chooses to host it."
…tests
Both `sweep_loop_emits_decisions_on_tick` and
`dropping_abort_stops_sweep_loop` previously used `tokio::time::sleep`
in `start_paused = true` mode. That works through the runtime's
auto-advance heuristic for parked tasks, but the heuristic is sensitive
to the order of pending wake-ups across multiple tasks and can be
flaky in practice.
- `sweep_loop_emits_decisions_on_tick`: switch to
`tokio::time::advance(150ms)` which explicitly moves the clock and
yields until pending wake-ups have been polled — deterministic.
- `dropping_abort_stops_sweep_loop`: drop the artificial sleep
altogether. Aborting the JoinHandle wakes the spawned task
synchronously; the runtime polls it, the harness observes the
abort, and the task ends — dropping its sender. `rx.recv().await`
parks briefly while that happens and then returns `None`. No time
advancement needed.
Stress-tested both tests 50× back-to-back: all pass.
YutaoMa
reviewed
May 4, 2026
YutaoMa
reviewed
May 4, 2026
YutaoMa
reviewed
May 4, 2026
Rewrite the doc comment to be reference documentation rather than a
design narrative. Drops the editorializing ("the right behavior") and
the first-person reasoning, keeps the three things a developer needs:
what the constant controls, why this size, what happens at capacity
(and why decisions can't be dropped or coalesced), and how to override.
The previous design used two separate `AtomicU64`s and snapshotted via two independent `swap` calls — the doc comment claimed this was atomic across the pair, but it isn't: an RPC completing between the two swaps inflates the next snapshot by one event, biasing the failure-percentage computation slightly under contention. Pack both counters into one `AtomicU64` (high 32 bits: successes, low 32 bits: failures). `record_*` becomes a single `fetch_add` (same hot- path cost as before), `snapshot_and_reset` becomes a single `swap(0)`, and the snapshot is now genuinely atomic across the pair — matching the bucket-swap semantics the gRFC describes. Each counter is capped at `u32::MAX` per sweep interval. Exceeding it would carry into the other counter's bits, but the cap is unreachable for realistic workloads (> 4 × 10⁹ RPCs to one endpoint within one interval). Documented on the struct.
YutaoMa
reviewed
May 4, 2026
YutaoMa
reviewed
May 4, 2026
YutaoMa
reviewed
May 4, 2026
YutaoMa
reviewed
May 4, 2026
Guard the `100 * failure / total` division against `total == 0`. gRFC A50 doesn't forbid `request_volume == 0`, in which case the qualifying filter `c.total >= request_volume` admits candidates with zero traffic; the spec is silent on `0/0`, so skip those endpoints rather than panic.
gu0keno0
reviewed
May 4, 2026
| } | ||
|
|
||
| /// Variant of [`Self::spawn`] that accepts custom runtime options. | ||
| pub(crate) fn spawn_with( |
Collaborator
There was a problem hiding this comment.
What about we just don't use these spawned loops at all, and just using RPCs as ticks for the outlier detection window? You can still use wallclock time, just that instead of ticking every second, tick it with every RPC call.
This works because tonic-xds only need to deal with all ejection signals in loadbalancer.poll_ready(), which already take mut self of loadbalancer, the decision of outlier detection will be global and without race conditions.
…tests Drop the test-only `sort` helper that compared `EjectionDecision`s by their `Debug` string representation, which was fragile (any change to the `Debug` impl would silently change ordering). Derive `PartialOrd` and `Ord` on `EjectionDecision` (and on `EndpointAddress` / `EndpointHost`, since the address is the inner field) and call `Vec::sort` directly at the one test site.
When an already-ejected endpoint has in-flight RPCs that complete during its ejection backoff, those completions accumulate on its counter. At the next sweep the algorithm may "re-eject" the host (refreshing its `ejected_at` timestamp and bumping the multiplier). That action does not change the count of currently-ejected addresses, so per A50's `max_ejection_percent` check it must not consume a slot in the cap — but the previous code decremented the budget for it, under-counting how many *new* ejections the cap allows. Track the pre-sweep ejection state on each `Candidate` and only decrement the budget for new ejections in the failure-percentage algorithm. Add a regression test covering the specific scenario.
Replace the spawned sweep loop + mpsc channel with an on-demand model:
the detector exposes `maybe_run_sweep(&mut self, now: Instant) -> Vec
<EjectionDecision>` and the consumer (the load balancer in a follow-up
PR) calls it from its own event loop — typically `poll_ready` —
gated by wallclock time.
This eliminates a significant amount of machinery:
- `tokio::spawn`, `sweep_loop`, `AbortOnDrop`, the mpsc channel.
- The bounded-channel capacity option, its constant, and its docs
(`OutlierDetectorOptions::decisions_channel_capacity`,
`DEFAULT_DECISIONS_CHANNEL_CAPACITY`).
- `OutlierDetectorOptions` itself — collapses to two constructors
`new(config)` and `with_rng(config, rng)`.
- The `Mutex` on `state` — the consumer's `&mut self` already
serializes access.
- Two `#[tokio::test(start_paused = true)]` tests that exercised the
spawned task and its abort handle.
Sweep timing now depends on RPC traffic: when no RPCs flow, no sweeps
run. This matches A50's intent (sweeps happen approximately every
`interval` while traffic is flowing) and is observably equivalent
because ejection only matters during endpoint picking, which only
happens during RPCs. Suggested by the PR review.
Tests:
- All algorithm-level tests rewritten to use owned `OutlierDetector`
+ `&mut self` calls, no `Mutex::lock()`, no Arc.
- Three new `maybe_run_sweep_*` tests cover the interval gate:
runs on first call, skips before interval elapsed, runs after.
- Existing failure-percentage and multiplier/un-ejection tests
unchanged in spirit; just adjusted to the new ownership model.
Pass through every doc comment and inline comment, removing rationale,
timeline language, and explanations that don't help a future reader.
Notable trims:
- Module docstring drops "Knows nothing about the data path:" framing,
the "lands in a follow-up PR" timeline (regression — flagged and
removed earlier on a different doc), and the "(mean and standard
deviation across the qualifying hosts)" parenthetical.
- `Rng` trait drops the "Abstracted so tests can inject" rationale.
- `OutlierDetector` struct drops "State is owned (no `Mutex`, no
`Arc`):" framing.
- `add_endpoint` / `remove_endpoint` / `with_rng` lose the trailing
usage hints / explanatory parentheticals.
- `maybe_run_sweep` / `run_sweep` tightened to facts-only.
- Inline comments inside `run_sweep` drop "we model that" and
"intentionally not yet dispatched in this PR" timeline.
- Inline comment for the budget-decrement guard now points at
`Candidate::already_ejected` instead of duplicating its doc.
- Test `already_ejected_re_ejection_does_not_consume_budget` drops
the "this would fail before the fix" git-history paragraph.
YutaoMa
approved these changes
May 6, 2026
ankurmittal
approved these changes
May 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Second of the gRFC A50: xDS Outlier Detection series. PR 1 (#2604) introduced the validated config types; this PR adds the sweep engine and the failure-percentage ejection algorithm. The success-rate algorithm — which adds float-math (mean / variance / standard deviation across qualifying hosts) — lands in a focused follow-up PR.
No integration yet. The data path still uses the existing P2C balancer; this module is exercised purely through unit tests. Wiring it into
LoadBalancer(and the per-endpoint RPC outcome interception layer) is the subject of follow-up PRs.What's in this PR
EndpointCounters— lock-freeAtomicU64success/failure counters.record_success/record_failureare called from the data path;snapshot_and_resetis called once per sweep.EjectionDecision—Eject(addr)/Uneject(addr), the externally-visible output of a sweep.OutlierDetector— owns the per-endpoint state map, the config, and an injectableRng.add_endpoint(addr) -> Arc<EndpointCounters>returns the data-path handle;remove_endpoint(addr)unregisters.run_sweep(now: Instant) -> Vec<EjectionDecision>— pure, no I/O. Implements the A50 sweep order verbatim:ejected_at, increment multiplier).OutlierDetector::spawn/spawn_with_rng) — owns the channel sender and forwards each decision. The loop is scoped by anAbortOnDrop; the receiver closes when the task ends.A50 compliance
>against the threshold (an address sitting exactly at the threshold is not ejected).min(base * multiplier, max(base, max_ejection_time)).max_ejection_percentbudgets concurrent ejections.What's deliberately deferred
OutlierDetectionConfig.success_rateis set on a cluster, this PR ignores it (documented in the module docstring).LoadBalancer. The mpsc channel ofEjectionDecisions polled by the LB (usingEjectedChannelfrom tonic-xds: Implement subchannel state machine #2587) is its own PR.Servicewrapper that classifies HTTP/gRPC status into success/failure and incrementsEndpointCountersis its own PR.ClusterResource. Same as above — wired alongside integration.Why
run_sweepreturns aVecinstead of sending on a channelKeeping
run_sweeppure has two payoffs: (a) algorithm tests don't need a channel or tokio time machinery, justdetector.run_sweep(now); and (b) dropping theAbortOnDropreliably closes the receiver, because the sender lives on the spawned task — not inside theOutlierDetectorArcthat consumers may hold.Test coverage (22 unit tests)
add_endpointreturns the same handle on second call,remove_endpointclears state.minimum_hostsandrequest_volumegates;enforcing_failure_percentage = 0never ejects.base × multiplier; same-sweep re-ejection grows the multiplier;max_ejection_timecaps the duration;max_ejection_percentcaps concurrent ejections; multiplier decrements on healthy intervals (with and without traffic).tokio::time::pause()); dropping theAbortOnDropcloses the receiver.Test plan
cargo test -p tonic-xds --lib outlier_detection— 22 passing tests.cargo test -p tonic-xds --lib— full lib suite passes (no regressions).cargo fmt -p tonic-xds --checkclean.cargo clippy -p tonic-xds --lib --all-features -- -D warningsclean.