Recover HTTP2 connection from half-open socket using pingInterval.#9444
Open
joelc wants to merge 3 commits into
Open
Recover HTTP2 connection from half-open socket using pingInterval.#9444joelc wants to merge 3 commits into
joelc wants to merge 3 commits into
Conversation
A stateful firewall *CAN* silently kill a pooled HTTP2 connection at L3 (no RST, no FIN). When this happens with okhttp, the next frame write that fills the kernel TCP send buffer blocks indefinitely. pingInterval is documented to detect exactly this, but three issues prevented it from firing: * The ping watchdog ran on writerQueue, so it queued behind the wedged frame write and never ticked. Moved to its own pingQueue; the ping write itself is fire-and-forget onto writerQueue. * failConnection -> close -> shutdown -> writer.goAway() deadlocked on writer.lock held by the wedged write. failConnection now cancels the socket first to unblock the writer. * SSLSocket.close() can block for many seconds attempting to send close_notify on a half-open connection. ConnectPlan now closes the raw TCP socket before the SSL layer on cancel. This PR adds a container-tests repro using socat amd an iptables DROP to simulate such a firewall drop. It asserts the failure within 3x pingInterval.
added 2 commits
May 15, 2026 17:15
3.19 is gone
… the connection.
Restores the contract asserted by HttpOverHttp2Test.missingPongsFailsConnection.
The earlier dead-socket fix made failConnection() cancel the socket before calling close(), to unblock a writer stuck on a half-open TCP send. That ordering also surfaced SocketException("Socket closed") to in-flight callers instead of the StreamResetException(PROTOCOL_ERROR) they previously saw, because the reader thread's socket read returns SocketException before close() -> stream.close() has a chance to set errorCode on Http2Stream.
So: mark active streams with closeLater(PROTOCOL_ERROR) up front, before cancelling the socket.
closeLater enqueues the RST_STREAM via writerQueue rather than writing it synchronously under writer.lock — that lock is held by the stuck frame write, so close()'s synchronous RST_STREAM path would itself block.
With errorCode set then Http2Stream.closeInternal returns early on subseqent calls and a racing failConnection from the reader thread can not overwrite the callers StreamResetException with the SocketException it caught.
Collaborator
|
Great analysis. A permanently wedged writer is likely gonna cause a bunch of liveness bugs. I'll try to address this report specifically and that issue generally. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A stateful firewall CAN silently kill a pooled HTTP2 connection at L3 (no RST, no FIN).
When this happens with okhttp, the next frame write that fills the kernel TCP send buffer blocks indefinitely.
pingInterval is documented to detect exactly this, but three issues prevented it from firing:
The ping watchdog ran on writerQueue, so it queued behind the wedged frame write and never ticked. Moved to its own pingQueue; the ping write itself is fire-and-forget onto writerQueue.
failConnection -> close -> shutdown -> writer.goAway() deadlocked on writer.lock held by the wedged write. failConnection now cancels the socket first to unblock the writer.
SSLSocket.close() can block for many seconds attempting to send close_notify on a half-open connection. ConnectPlan now closes the raw TCP socket before the SSL layer on cancel.
This PR adds a container-tests repro using socat amd an iptables DROP to simulate such a firewall drop. It asserts the failure within 3x pingInterval.
(Apologies for additional pushes, was unable to run entire suite locally).