Implement PickFirst load balancer#2570
Implement PickFirst load balancer#2570nathanielford wants to merge 31 commits intohyperium:masterfrom
Conversation
a442272 to
73397ff
Compare
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod test { |
There was a problem hiding this comment.
There are some tricky edge cases found by Go's e2e-style tests that we discovered when implementing a new dual-stack pick_first LB in Go. I added unit tests for them to ensure the balancer behaved as expected. These tests are mainly in two files:
- pickfirst_ext_test.go: These are e2e-style tests that create a gRPC channel to a fake server.
- pickfirst_test.go: These use a mock channel, similar to the tests in Rust.
It may not be possible to write e2e-style tests in Rust right now since we don't have the same test utilities as Go. However, we can still test the same scenarios using a mock channel.
I would recommend seeing if Gemini can convert these Go tests into tests for Rust, skipping the Happy Eyeballs ones. This would act as a conformance test. I did something similar to generate tests for credentials in the following way:
- Clone gRPC Go.
- Point the Gemini CLI to the test files for gRPC Go and gRPC Rust.
- Ask Gemini to create similar test cases for gRPC Rust.
If it gives decent results without much effort, that's great; otherwise, we can improve test coverage later.
There was a problem hiding this comment.
I think I was able to do this in the tests, particularly the last two tests in pick_first.rs. LMK if you think something is misaligned with these!
73397ff to
bdf7fc3
Compare
7ea10ad to
e1bcaf4
Compare
…ss, and updated sync testing framework
…A61 endpoint handling This commit implements the PickFirst load balancer policy for Tonic gRPC, focusing on: - Efficient subchannel management with backoff preservation. - "Stickiness" support: continuing to use an existing Ready subchannel if it remains in resolver updates. - Compliance with gRFC A61: endpoints are now shuffled before being flattened into an address list, ensuring multiple addresses for a single endpoint (e.g., IPv4/IPv6) stay together. - Clean state reset: subchannels and selected state are now cleared when receiving an empty address list. - Alignment with the updated synchronous testing framework in master. Includes comprehensive test coverage for basic connection, failover, stickiness, exhaustion, deterministic endpoint shuffling, de-duplication, and empty updates.
…active failover This change enhances the PickFirst load balancing policy to better support gRFC A61 (Happy Eyeballs) and improve connection establishment latency. Key changes: - Implement IPv6/IPv4 address interleaving in `compile_address` to ensure subsequent connection attempts alternate between protocol families. - Introduce a `subchannel_states` cache in `PickFirstPolicy` to track the connectivity status of managed subchannels. - Refactor connection logic to use a `frontier_index` and proactively skip subchannels known to be in `TransientFailure` (e.g., during backoff). - Update `advance_frontier` to safely maintain the index within the bounds of the address list, ensuring the policy remains reactive to recovery. - Add deterministic unit tests for shuffling and interleaving logic.
| let error = self | ||
| .last_connection_error | ||
| .clone() | ||
| .unwrap_or_else(|| "all addresses in transient failure".to_string()); |
There was a problem hiding this comment.
I think if we update self.last_connection_error with an error message in the following two events, it is guaranteed to be present:
- When a subchannel starts in
TRANSIENT_FAILURE. - When a subchannel enters
TRANSIENT_FAILURE.
Then we can call .expect() and avoid using a custom error message generated by the LB.
| // Update the cache for all updates. | ||
| self.subchannel_states | ||
| .insert(subchannel.address(), state.clone()); |
There was a problem hiding this comment.
Before inserting, we should first ensure that the address is part of the current list of subchannels. If not, the function should return early. This can happen when a subchannel update has been queued by the channel while a resolver updated arrives, removing the address.
| } | ||
|
|
||
| // Steady State Retries: Automatically connect when a subchannel goes IDLE. | ||
| if self.steady_state.is_some() && state.connectivity_state == ConnectivityState::Idle { |
There was a problem hiding this comment.
nit: It would be good to group the handling of steady_state and the first pass into an if/else block, and handle common transitions (e.g., READY or READY->IDLE) before this block. This would allow readers to understand the first pass and steady-state behavior separately.
| // Steady State Failures: Count failures across all subchannels. | ||
| if let Some(ref mut ss) = self.steady_state { | ||
| if state.connectivity_state == ConnectivityState::TransientFailure { | ||
| if ss.record_failure(self.subchannels.len()) { |
There was a problem hiding this comment.
nit: As a minor simplification, we can pass the channel controller to record_failure and let it handle requesting re-resolution itself.
There was a problem hiding this comment.
I moved all the steady-state stuff into the steady state handler. I think that it makes it a bit cleaner, to keep all the state tracking there. LMK if you think something is leaky though!
| if let Some(next_sc) = self.advance_frontier(false) { | ||
| let next_sc = next_sc.clone(); // Clone to avoid borrow issues | ||
| self.trigger_subchannel_connection(next_sc, channel_controller); | ||
| } else { |
There was a problem hiding this comment.
We need to cancel the timer in the else case also.It can probably happen before the if/else block.
Comparing the references of timer_handle with the Go implementation, we also need to abort the timer when the LB policy is dropped and when a new resolver update arrives.
There was a problem hiding this comment.
This should all be handled - though do please audit all the possible cases!
880fb6f to
9cbb1a4
Compare
…mediate READY activation
…preserve stickiness
b02264a to
6698786
Compare
…ollision unit test
…nually written hash impl
…into refactor/AddressHashing
…stLB # Conflicts: # grpc/Cargo.toml
|
Note that this now includes #2631, which will merge first so as to keep a clearer record. However, I needed the fix to get the tests ported from Go to work. |
Motivation
Full implementation of the pick first load balancer, including 'Happy eyeballs' features.
Solution
Load balancing implementation to pick the first available endpoint to connect to, maintaining stickiness across endpoint updates if configured. Handles accepting new LB configuration and subchannel reconstruction.
Prototype is at https://github.com/nathanielford/grpc-rust-testbed/tree/main/pick_first_lib
Notes