fix(connection): Give pong its own channel#412
fix(connection): Give pong its own channel#412darinspivey wants to merge 3 commits intostreamnative:masterfrom
pong its own channel#412Conversation
The pong replies to the broker used to share the outbound channel with all other connection traffic. When the channel is full under high load, pongs were being discarded due to try_send on a bounded channel that was full. Not only does this flood the log with errors, but if the broker does not receive the pong in time, it will kill the connection and the cycle will repeat. This commit gives pong its own dedicated bounded(1) channel so that it cannot be crowded out by other outbound traffic. The sink writer drains the pong channel ahead of the main channel via select_biased!, ensuring pong responses are flushed to the socket as soon as possible. Fixes: streamnative#408
There was a problem hiding this comment.
Pull request overview
This PR fixes connection keepalive instability under high outbound load by ensuring broker ping requests can always be answered with a pong, independent of normal outbound traffic congestion (Fixes #408).
Changes:
- Introduces a dedicated bounded(1) async_channel for
pongresponses instead of sharing the main outbound channel. - Updates the sink-writer task to prioritize draining the pong channel via
select_biased!. - Adjusts
Receiverand its test wiring to use the new pong sender.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let msg = futures::select_biased! { | ||
| msg = pong_rx.recv().fuse() => match msg { | ||
| Ok(msg) => msg, | ||
| Err(_) => break, | ||
| }, |
There was a problem hiding this comment.
This doesn't feel like it's worth changing. The Receiver only drops when the inbound stream is done, the shutdown signal fires, or a stream error occurred--in all cases the connection is being torn down and error is set, so cascading the sink shutdown is correct behavior. Keeping a pong_tx clone alive in the sink to "keep draining rx" would just delay the inevitable while pretending the connection is still healthy.
| if self.pong_tx.try_send(messages::pong()).is_err() { | ||
| error!("failed to send pong: pong already pending, sink may be stalled"); |
| let (tx, rx) = async_channel::bounded(outbound_channel_size); | ||
| let (pong_tx, pong_rx) = async_channel::bounded(1); | ||
| let (registrations_tx, registrations_rx) = mpsc::unbounded(); | ||
| let error = SharedError::new(); | ||
| let (receiver_shutdown_tx, receiver_shutdown_rx) = oneshot::channel(); |
There was a problem hiding this comment.
Thanks for the suggestion. I added a focused Receiver-level test (receiver_routes_ping_to_pong_channel) that asserts pings get routed to the dedicated pong_tx channel rather than the shared outbound. This guards against the most realistic regression — someone reverting the Receiver back to the shared tx, which would re-introduce #408.
I deliberately didn't add a test for the sink-writer's select_biased! priority itself. That code lives inline inside Connection::new() and isn't unit-testable without either (a) extracting the sink loop into a free function purely for testability, or (b) a TCP-loopback integration test that fills the wire and asserts pong ordering. Both felt like more invasive change than the bot's ask warranted. select_biased! has well-known, language-level semantics, the block is short and unlikely to be silently broken without the channel separation also being undone, and the channel-separation test catches the realistic failure mode.
Happy to do the refactor if you'd prefer the stronger guarantee.
When the pong tries to send, the channel may be closed. Make sure to use an error messages that makes sense for all conditions.
Adds a focused regression test for the Receiver: when a ping arrives on the inbound stream, the pong response must land on `pong_tx` (the dedicated bounded(1) channel) rather than the shared outbound channel. This is the structural half of the streamnative#408 fix. If anyone reverts the Receiver back to using the shared outbound `tx`, this test goes red.
BewareMyPower
left a comment
There was a problem hiding this comment.
When the channel is full under high load, pongs were being discarded due to try_send on a bounded channel that was full.
This behavior should be expected. Other clients also send Ping/Pong in the same channel of other requests. https://github.com/apache/pulsar/blob/cd0ab9d6ad33f9cde9bcf56177a3f6f9deb9f510/pulsar-common/src/main/java/org/apache/pulsar/common/protocol/PulsarHandler.java#L93
IMO, the ping pong RPC is not only used for simple connectivity detection. It actually detects whether the connection can process the requests in time.
If you have encountered this issue in production, maybe you'd better investigate which other commands have blocked the I/O thread too long
|
Thanks for pushing back — I want to make sure I'm not papering over a real "your client is unhealthy" signal. The thing that nags me is that pong is the only outbound message that silently discards under backpressure. After #319, every other control path either blocks ( The Netty comparison is also a bit different in shape: I considered just switching pong to If the worry is losing the diagnostic signal, I'm happy to keep an |
|
Oh it makes sense. I just asked LLM to write a mermaid graph flowchart TD
subgraph App["Application tasks"]
P1["Producer::send_non_blocking / send"]
K1["Keepalive task: send_ping()"]
end
subgraph CS["ConnectionSender"]
S1["send_message_non_blocking()"]
S2["send_ping()"]
end
subgraph Reg["registrations_tx (unbounded mpsc)"]
R1["Register::Request { key, resolver }"]
R2["Register::Ping { resolver }"]
end
subgraph Out["outbound tx (bounded async_channel)"]
O1["Produce/BaseCommand::Send"]
O2["Ping"]
O3["Pong"]
O4["Other outbound RPCs"]
end
subgraph Conn["Connection tasks"]
RX["Receiver future\nreads broker frames\nand manages pending_requests"]
TX["Socket writer loop\nwhile let Ok(msg)=rx.recv()\n sink.send(msg).await"]
end
subgraph Broker["Broker socket"]
B1["TCP / framed sink+stream"]
end
P1 --> S1
K1 --> S2
S1 --> R1
S1 -->|tx.try_send or tx.send| O1
S2 --> R2
S2 -->|tx.send| O2
R1 --> RX
R2 --> RX
O1 --> TX
O2 --> TX
O3 --> TX
O4 --> TX
TX --> B1
B1 --> RX
RX -->|inbound Ping from broker| O3
Actually I think #312 mixed up the back pressure on send requests and the socket writes. We should only use the bounded channel for Line 1244 in cf67345 We should use a dedicated unbounded channel for other commands, including Pong. loop {
let msg = futures::select_biased! {
msg = control_rx.recv().fuse() => msg?, // other commands are sent via control_tx
msg = data_rx.recv().fuse() => msg?, // send commands are sent via bounded data_rx
};
sink.send(msg).await?;
}WDYT? |
|
Yeah, this is the right shape--splitting the data plane from the control plane handles pong as a special case of the general rule and gets rid of the latent stall risk on Quick sanity-check on classification before I start cutting:
Also, given the scope is bigger than the original PR, do you want me to push it onto this branch, or close #412 and open a fresh one against master with a more accurat title? |
|
Yes, both are right. You can open a new PR or just edit this PR's title and description. Both are okay since the main reviewer (me) has all context. |
The pong replies to the broker used to share the outbound channel with all other connection traffic. When the channel is full under high load, pongs were being discarded due to try_send on a bounded channel that was full. Not only does this flood the log with errors, but if the broker does not receive the pong in time, it will kill the connection and the cycle will repeat.
This commit gives pong its own dedicated bounded(1) channel so that it cannot be crowded out by other outbound traffic. The sink writer drains the pong channel ahead of the main channel via select_biased!, ensuring pong responses are flushed to the socket as soon as possible.
Fixes: #408