Skip to content

feat(tonic-xds): add OutlierDetector sweep engine + failure-percentage algorithm (gRFC A50)#2619

Open
LYZJU2019 wants to merge 18 commits intohyperium:masterfrom
LYZJU2019:lyzju2019/a50-outlier-detector
Open

feat(tonic-xds): add OutlierDetector sweep engine + failure-percentage algorithm (gRFC A50)#2619
LYZJU2019 wants to merge 18 commits intohyperium:masterfrom
LYZJU2019:lyzju2019/a50-outlier-detector

Conversation

@LYZJU2019
Copy link
Copy Markdown
Contributor

@LYZJU2019 LYZJU2019 commented Apr 30, 2026

Summary

Second of the gRFC A50: xDS Outlier Detection series. PR 1 (#2604) introduced the validated config types; this PR adds the sweep engine and the failure-percentage ejection algorithm. The success-rate algorithm — which adds float-math (mean / variance / standard deviation across qualifying hosts) — lands in a focused follow-up PR.

No integration yet. The data path still uses the existing P2C balancer; this module is exercised purely through unit tests. Wiring it into LoadBalancer (and the per-endpoint RPC outcome interception layer) is the subject of follow-up PRs.

What's in this PR

  • EndpointCounters — lock-free AtomicU64 success/failure counters. record_success / record_failure are called from the data path; snapshot_and_reset is called once per sweep.
  • EjectionDecisionEject(addr) / Uneject(addr), the externally-visible output of a sweep.
  • OutlierDetector — owns the per-endpoint state map, the config, and an injectable Rng. add_endpoint(addr) -> Arc<EndpointCounters> returns the data-path handle; remove_endpoint(addr) unregisters.
  • run_sweep(now: Instant) -> Vec<EjectionDecision> — pure, no I/O. Implements the A50 sweep order verbatim:
    1. Snapshot (swap counters) every endpoint.
    2. Run the failure-percentage algorithm.
    3. Apply ejections (set ejected_at, increment multiplier).
    4. Step-5 housekeeping: decrement multiplier of every non-ejected address with multiplier > 0; un-eject addresses whose backoff has elapsed.
  • Spawned sweep loop (OutlierDetector::spawn / spawn_with_rng) — owns the channel sender and forwards each decision. The loop is scoped by an AbortOnDrop; the receiver closes when the task ends.

A50 compliance

  • Sweep step order matches the gRFC verbatim.
  • Failure-percentage uses strict > against the threshold (an address sitting exactly at the threshold is not ejected).
  • Multiplier decrement runs unconditionally on every non-ejected address with multiplier > 0, regardless of whether it received traffic that interval.
  • Ejection duration is min(base * multiplier, max(base, max_ejection_time)).
  • max_ejection_percent budgets concurrent ejections.

What's deliberately deferred

  • Success-rate algorithm. Adds float-math and 2 dedicated tests; lives cleanly in a follow-up PR. If OutlierDetectionConfig.success_rate is set on a cluster, this PR ignores it (documented in the module docstring).
  • Integration with LoadBalancer. The mpsc channel of EjectionDecisions polled by the LB (using EjectedChannel from tonic-xds: Implement subchannel state machine #2587) is its own PR.
  • RPC outcome interception. The Service wrapper that classifies HTTP/gRPC status into success/failure and increments EndpointCounters is its own PR.
  • Proto parsing into ClusterResource. Same as above — wired alongside integration.

Why run_sweep returns a Vec instead of sending on a channel

Keeping run_sweep pure has two payoffs: (a) algorithm tests don't need a channel or tokio time machinery, just detector.run_sweep(now); and (b) dropping the AbortOnDrop reliably closes the receiver, because the sender lives on the spawned task — not inside the OutlierDetector Arc that consumers may hold.

Test coverage (22 unit tests)

  • Counters: record + atomic snapshot-and-reset.
  • Endpoint registry: add_endpoint returns the same handle on second call, remove_endpoint clears state.
  • Failure-percentage algorithm: ejection above threshold; skip below threshold; boundary case — exactly at threshold does NOT eject; minimum_hosts and request_volume gates; enforcing_failure_percentage = 0 never ejects.
  • Ejection lifecycle: un-ejects after base × multiplier; same-sweep re-ejection grows the multiplier; max_ejection_time caps the duration; max_ejection_percent caps concurrent ejections; multiplier decrements on healthy intervals (with and without traffic).
  • Sweep loop: emits decisions on each tick (using tokio::time::pause()); dropping the AbortOnDrop closes the receiver.

Test plan

  • cargo test -p tonic-xds --lib outlier_detection — 22 passing tests.
  • cargo test -p tonic-xds --lib — full lib suite passes (no regressions).
  • cargo fmt -p tonic-xds --check clean.
  • cargo clippy -p tonic-xds --lib --all-features -- -D warnings clean.

Implement the core gRFC A50 outlier-detection algorithm: per-endpoint
success/failure counters, the success-rate and failure-percentage
ejection algorithms, the ejection-multiplier state machine, and a
periodic sweep task that emits ejection/un-ejection decisions on a
channel.

`run_sweep` is pure (returns a Vec<EjectionDecision>); the sweep loop
spawned by `OutlierDetector::spawn` owns the channel sender and
forwards decisions, so dropping the returned `AbortOnDrop` ends the
loop and closes the receiver. Tests drive `run_sweep` directly without
the channel or tokio time mechanics.

Algorithm coverage matches the gRFC:
  - Success-rate ejection with configurable `stdev_factor`,
    `enforcing_success_rate`, `minimum_hosts`, `request_volume`.
  - Failure-percentage ejection with `threshold`, `enforcing_failure_
    percentage`, `minimum_hosts`, `request_volume`.
  - Ejection multiplier increments on each ejection, decays on healthy
    intervals; ejection duration is `base * multiplier` capped at
    `max(base, max_ejection_time)`.
  - `max_ejection_percent` caps total concurrent ejections.

Probability rolls go through an injectable `Rng` trait (defaulting to
`fastrand`) so tests can pin enforcement decisions.

Standalone in this PR — no integration with the load balancer yet.
That lands in a follow-up alongside the per-endpoint outcome
interception layer.

Refs: https://github.com/grpc/proposal/blob/master/A50-xds-outlier-detection.md
Address two follow-up review comments from hyperium#2604 (the merged config
PR) by folding the doc updates into this PR:

- Module docstring: describe the actual integration plan (an mpsc
  channel of EjectionDecisions polled by LoadBalancer, leveraging
  EjectedChannel) instead of the original "filter on the Discover
  stream" wording. Add intra-doc links to the relevant types.

- enforcing_success_rate / enforcing_failure_percentage: clarify
  that each is the *enforcement probability* — distinct from the
  per-algorithm threshold (stdev_factor for success-rate, threshold
  for failure-percentage). Note that 0 disables enforcement while
  still computing statistics.

Also fix an unresolved intra-doc link in the algorithm module.
Three spec-compliance fixes to `run_sweep` and the failure-percentage
algorithm:

1. Reorder the sweep to match A50 step order: snapshot counters → run
   success-rate algorithm → run failure-percentage algorithm → step-5
   housekeeping (decrement non-ejected multipliers, un-eject elapsed
   ejections). The previous order (un-eject before algorithms) caused
   spurious `Uneject` decisions whenever the same sweep also re-ejected
   the address. Per spec, re-ejection refreshes `ejected_at` to `now`
   before the un-eject check runs, so no transient un-eject is emitted.

2. Drop the `total > 0` traffic gate from the multiplier-decrement
   step. A50 says a non-ejected address with multiplier > 0 has its
   multiplier decremented every sweep, regardless of whether it
   received traffic that interval.

3. Failure-percentage now uses strict `>` against the threshold (was
   `>=`). Per A50: "If the address's failure percentage is greater
   than `failure_percentage_ejection.threshold`..." — an address
   sitting exactly at the threshold is not ejected.

Also: drop the explicit "skip ejected hosts from candidate list" pre-
filter. Per spec the algorithms iterate every address; ejected hosts
naturally fail the `request_volume` gate since they receive no traffic
in production. Behavior on real workloads is unchanged.

Test changes:
  - `re_ejection_doubles_duration` now asserts a single `Eject`
    decision (no transient `Uneject`) under the corrected sweep order.
  - New `failure_percentage_at_threshold_does_not_eject` covers the
    strict-`>` boundary.
  - New `multiplier_decrements_even_without_traffic` covers the
    no-traffic-gate fix.
Drop the success-rate algorithm and its tests from this PR so the
outlier-detection PR is minimal and stand-alone. The scaffolding
(sweep loop, multiplier state, counters, max-ejection-percent budget)
is unchanged and still exercised by the failure-percentage algorithm
plus the multiplier / un-eject / cap tests.

If `OutlierDetectionConfig.success_rate` is set on the cluster, it is
currently ignored. Documented in the module docstring with a pointer
to the follow-up PR.

Removes:
  - `OutlierDetector::run_success_rate` (mean / variance / sqrt math).
  - `success_rate` dispatch in `run_sweep`.
  - `run_failure_percentage`'s `!out.contains` filter — dead now that
    only one algorithm runs per sweep.
  - `success_rate_ejects_outlier_below_threshold` test.
  - `success_rate_no_ejection_when_all_uniform` test.
  - The `sr_config` test helper.
  - Unused `SuccessRateConfig` import.
@LYZJU2019 LYZJU2019 changed the title feat(tonic-xds): add OutlierDetector sweep engine (gRFC A50) feat(tonic-xds): add OutlierDetector sweep engine + failure-percentage algorithm (gRFC A50) Apr 30, 2026
Switch from `mpsc::unbounded_channel` to `mpsc::channel(256)` for the
ejection-decision stream that the sweep loop emits.

The decisions are edge-triggered (`Eject`/`Uneject` transitions, not
state snapshots), so the consumer must process every event in order;
we can't drop or coalesce. But we don't want unbounded memory growth
either if the consumer stalls. A bounded channel gives us:

  - Same correctness as unbounded — no events dropped, ordered delivery.
  - Bounded memory.
  - Natural backpressure: when the buffer fills, `tx.send().await`
    parks the sweep task, which (combined with `MissedTickBehavior::
    Skip`) throttles sweep cadence to whatever rate the consumer can
    drain. Computing more decisions than the consumer can apply just
    widens the desync.

Capacity is 256 — at most `2 * num_endpoints` decisions per sweep, so
this buffers several sweeps' worth of decisions for clusters of typical
size. A docstring on `DECISIONS_CHANNEL_CAPACITY` captures the
rationale for future readers.
Replace `spawn_with_rng` with `spawn_with`, taking an
`OutlierDetectorOptions` struct that bundles the RNG and the new
configurable `decisions_channel_capacity`. Defaults are unchanged
(`fastrand` RNG, capacity 256).

The hard-coded constant becomes `DEFAULT_DECISIONS_CHANNEL_CAPACITY`
and is no longer the only knob — production callers may want to bump
the bound for clusters with very large endpoint sets (worst case
`2 * num_endpoints` decisions per sweep) or unusually slow consumers.

Using a struct instead of a long argument list means future runtime
knobs (custom Tokio runtime, alternate backoff policies, observability
hooks, …) can be added without breaking call sites — callers typically
construct via `..Default::default()`.

The xDS-derived `OutlierDetectionConfig` stays separate from these
host-side runtime knobs, keeping a clean line between "what the xDS
proto specifies" and "how this binary chooses to host it."
…tests

Both `sweep_loop_emits_decisions_on_tick` and
`dropping_abort_stops_sweep_loop` previously used `tokio::time::sleep`
in `start_paused = true` mode. That works through the runtime's
auto-advance heuristic for parked tasks, but the heuristic is sensitive
to the order of pending wake-ups across multiple tasks and can be
flaky in practice.

  - `sweep_loop_emits_decisions_on_tick`: switch to
    `tokio::time::advance(150ms)` which explicitly moves the clock and
    yields until pending wake-ups have been polled — deterministic.
  - `dropping_abort_stops_sweep_loop`: drop the artificial sleep
    altogether. Aborting the JoinHandle wakes the spawned task
    synchronously; the runtime polls it, the harness observes the
    abort, and the task ends — dropping its sender. `rx.recv().await`
    parks briefly while that happens and then returns `None`. No time
    advancement needed.

Stress-tested both tests 50× back-to-back: all pass.
Comment thread tonic-xds/src/client/loadbalance/outlier_detection.rs Outdated
Comment thread tonic-xds/src/client/loadbalance/outlier_detection.rs Outdated
Comment thread tonic-xds/src/client/loadbalance/outlier_detection.rs
LYZJU2019 added 4 commits May 4, 2026 10:29
Rewrite the doc comment to be reference documentation rather than a
design narrative. Drops the editorializing ("the right behavior") and
the first-person reasoning, keeps the three things a developer needs:
what the constant controls, why this size, what happens at capacity
(and why decisions can't be dropped or coalesced), and how to override.
The previous design used two separate `AtomicU64`s and snapshotted via
two independent `swap` calls — the doc comment claimed this was atomic
across the pair, but it isn't: an RPC completing between the two swaps
inflates the next snapshot by one event, biasing the failure-percentage
computation slightly under contention.

Pack both counters into one `AtomicU64` (high 32 bits: successes, low
32 bits: failures). `record_*` becomes a single `fetch_add` (same hot-
path cost as before), `snapshot_and_reset` becomes a single `swap(0)`,
and the snapshot is now genuinely atomic across the pair — matching
the bucket-swap semantics the gRFC describes.

Each counter is capped at `u32::MAX` per sweep interval. Exceeding it
would carry into the other counter's bits, but the cap is unreachable
for realistic workloads (> 4 × 10⁹ RPCs to one endpoint within one
interval). Documented on the struct.
Comment thread tonic-xds/src/client/loadbalance/outlier_detection.rs Outdated
Comment thread tonic-xds/src/client/loadbalance/outlier_detection.rs Outdated
Comment thread tonic-xds/src/client/loadbalance/outlier_detection.rs Outdated
Comment thread tonic-xds/src/client/loadbalance/outlier_detection.rs Outdated
Guard the `100 * failure / total` division against `total == 0`.
gRFC A50 doesn't forbid `request_volume == 0`, in which case the
qualifying filter `c.total >= request_volume` admits candidates with
zero traffic; the spec is silent on `0/0`, so skip those endpoints
rather than panic.
}

/// Variant of [`Self::spawn`] that accepts custom runtime options.
pub(crate) fn spawn_with(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about we just don't use these spawned loops at all, and just using RPCs as ticks for the outlier detection window? You can still use wallclock time, just that instead of ticking every second, tick it with every RPC call.

This works because tonic-xds only need to deal with all ejection signals in loadbalancer.poll_ready(), which already take mut self of loadbalancer, the decision of outlier detection will be global and without race conditions.

LYZJU2019 and others added 6 commits May 4, 2026 15:27
…tests

Drop the test-only `sort` helper that compared `EjectionDecision`s by
their `Debug` string representation, which was fragile (any change to
the `Debug` impl would silently change ordering). Derive `PartialOrd`
and `Ord` on `EjectionDecision` (and on `EndpointAddress` /
`EndpointHost`, since the address is the inner field) and call
`Vec::sort` directly at the one test site.
When an already-ejected endpoint has in-flight RPCs that complete
during its ejection backoff, those completions accumulate on its
counter. At the next sweep the algorithm may "re-eject" the host
(refreshing its `ejected_at` timestamp and bumping the multiplier).
That action does not change the count of currently-ejected addresses,
so per A50's `max_ejection_percent` check it must not consume a slot
in the cap — but the previous code decremented the budget for it,
under-counting how many *new* ejections the cap allows.

Track the pre-sweep ejection state on each `Candidate` and only
decrement the budget for new ejections in the failure-percentage
algorithm. Add a regression test covering the specific scenario.
Replace the spawned sweep loop + mpsc channel with an on-demand model:
the detector exposes `maybe_run_sweep(&mut self, now: Instant) -> Vec
<EjectionDecision>` and the consumer (the load balancer in a follow-up
PR) calls it from its own event loop — typically `poll_ready` —
gated by wallclock time.

This eliminates a significant amount of machinery:
  - `tokio::spawn`, `sweep_loop`, `AbortOnDrop`, the mpsc channel.
  - The bounded-channel capacity option, its constant, and its docs
    (`OutlierDetectorOptions::decisions_channel_capacity`,
    `DEFAULT_DECISIONS_CHANNEL_CAPACITY`).
  - `OutlierDetectorOptions` itself — collapses to two constructors
    `new(config)` and `with_rng(config, rng)`.
  - The `Mutex` on `state` — the consumer's `&mut self` already
    serializes access.
  - Two `#[tokio::test(start_paused = true)]` tests that exercised the
    spawned task and its abort handle.

Sweep timing now depends on RPC traffic: when no RPCs flow, no sweeps
run. This matches A50's intent (sweeps happen approximately every
`interval` while traffic is flowing) and is observably equivalent
because ejection only matters during endpoint picking, which only
happens during RPCs. Suggested by the PR review.

Tests:
  - All algorithm-level tests rewritten to use owned `OutlierDetector`
    + `&mut self` calls, no `Mutex::lock()`, no Arc.
  - Three new `maybe_run_sweep_*` tests cover the interval gate:
    runs on first call, skips before interval elapsed, runs after.
  - Existing failure-percentage and multiplier/un-ejection tests
    unchanged in spirit; just adjusted to the new ownership model.
Pass through every doc comment and inline comment, removing rationale,
timeline language, and explanations that don't help a future reader.
Notable trims:

  - Module docstring drops "Knows nothing about the data path:" framing,
    the "lands in a follow-up PR" timeline (regression — flagged and
    removed earlier on a different doc), and the "(mean and standard
    deviation across the qualifying hosts)" parenthetical.
  - `Rng` trait drops the "Abstracted so tests can inject" rationale.
  - `OutlierDetector` struct drops "State is owned (no `Mutex`, no
    `Arc`):" framing.
  - `add_endpoint` / `remove_endpoint` / `with_rng` lose the trailing
    usage hints / explanatory parentheticals.
  - `maybe_run_sweep` / `run_sweep` tightened to facts-only.
  - Inline comments inside `run_sweep` drop "we model that" and
    "intentionally not yet dispatched in this PR" timeline.
  - Inline comment for the budget-decrement guard now points at
    `Candidate::already_ejected` instead of duplicating its doc.
  - Test `already_ejected_re_ejection_does_not_consume_budget` drops
    the "this would fail before the fix" git-history paragraph.
@LYZJU2019 LYZJU2019 marked this pull request as ready for review May 6, 2026 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants