Skip to content

Recover HTTP2 connection from half-open socket using pingInterval.#9444

Open
joelc wants to merge 3 commits into
square:masterfrom
joelc:deadsocket-fix
Open

Recover HTTP2 connection from half-open socket using pingInterval.#9444
joelc wants to merge 3 commits into
square:masterfrom
joelc:deadsocket-fix

Conversation

@joelc
Copy link
Copy Markdown

@joelc joelc commented May 15, 2026

A stateful firewall CAN silently kill a pooled HTTP2 connection at L3 (no RST, no FIN).

When this happens with okhttp, the next frame write that fills the kernel TCP send buffer blocks indefinitely.

pingInterval is documented to detect exactly this, but three issues prevented it from firing:

  • The ping watchdog ran on writerQueue, so it queued behind the wedged frame write and never ticked. Moved to its own pingQueue; the ping write itself is fire-and-forget onto writerQueue.

  • failConnection -> close -> shutdown -> writer.goAway() deadlocked on writer.lock held by the wedged write. failConnection now cancels the socket first to unblock the writer.

  • SSLSocket.close() can block for many seconds attempting to send close_notify on a half-open connection. ConnectPlan now closes the raw TCP socket before the SSL layer on cancel.

This PR adds a container-tests repro using socat amd an iptables DROP to simulate such a firewall drop. It asserts the failure within 3x pingInterval.

(Apologies for additional pushes, was unable to run entire suite locally).

A stateful firewall *CAN* silently kill a pooled HTTP2 connection at L3 (no RST, no FIN).

When this happens with okhttp, the next frame write that fills the kernel TCP send buffer blocks indefinitely.

pingInterval is documented to detect exactly this, but three issues prevented it from firing:

* The ping watchdog ran on writerQueue, so it queued behind the wedged frame write and never ticked. Moved to its own pingQueue; the ping write itself is fire-and-forget onto writerQueue.

* failConnection -> close -> shutdown -> writer.goAway() deadlocked on writer.lock held by the wedged write. failConnection now cancels the socket first to unblock the writer.

* SSLSocket.close() can block for many seconds attempting to send close_notify on a half-open connection. ConnectPlan now closes the raw TCP socket before the SSL layer on cancel.

This PR adds a container-tests repro using socat amd an iptables DROP to simulate such a firewall drop. It asserts the failure within 3x pingInterval.
@joelc joelc force-pushed the deadsocket-fix branch from 9a751a3 to 792dcdc Compare May 15, 2026 21:13
Joel Caplin added 2 commits May 15, 2026 17:15
3.19 is gone
… the connection.

Restores the contract asserted by HttpOverHttp2Test.missingPongsFailsConnection.

The earlier dead-socket fix made failConnection() cancel the socket before calling close(), to unblock a writer stuck on a half-open TCP send. That ordering also surfaced SocketException("Socket closed") to in-flight callers instead of the StreamResetException(PROTOCOL_ERROR) they previously saw, because the reader thread's socket read returns SocketException before close() -> stream.close() has a chance to set errorCode on Http2Stream.

So: mark active streams with closeLater(PROTOCOL_ERROR) up front, before cancelling the socket.

closeLater enqueues the RST_STREAM via writerQueue rather than writing it synchronously under writer.lock — that lock is held by the stuck frame write, so close()'s synchronous RST_STREAM path would itself block.

With errorCode set then Http2Stream.closeInternal returns early on subseqent calls and a racing failConnection from the reader thread can not overwrite the callers StreamResetException with the SocketException it caught.
@swankjesse
Copy link
Copy Markdown
Collaborator

Great analysis. A permanently wedged writer is likely gonna cause a bunch of liveness bugs. I'll try to address this report specifically and that issue generally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants