CVPN-2346 Implement GSO offload on lightway-server#413
Conversation
|
Code coverage summary for 8b7d704: ✅ Region coverage 68% passes |
2b45738 to
ed9e475
Compare
b5849fe to
c62619f
Compare
| } | ||
|
|
||
| impl PluginList { | ||
| #[cfg(target_os = "linux")] |
There was a problem hiding this comment.
This need not be a linux specific method. Looks generic
There was a problem hiding this comment.
Failed lint because the function is unused on other platforms, needed to add back #[allow(dead_code)].
| // Expose the full slab to `recv_gso` as `&mut [u8]`. | ||
| // SAFETY: every byte of the slab was zero-initialized at | ||
| // construction; subsequent iters only ever shrunk `len` or | ||
| // overwrote bytes. We never hand out uninitialized memory. |
There was a problem hiding this comment.
This does not sound safe, as pkt is mutable we could also create new BytesMut and replace it.
So now if you reserve, it might be unintialized.
I think what you want is https://docs.rs/bytes/latest/bytes/struct.BytesMut.html#method.spare_capacity_mut which gives pointer to spare buffer which you can sent to recv_gso
There was a problem hiding this comment.
Because spare_capacity_mut(&mut self) -> &mut [MaybeUninit<u8>], I think we ultimately still need a unsafe cast on the &mut [MaybeUninit<u8>] because tun_rs::AsyncDevice::recv() only takes &mut [u8] in the end. I will think about it.
Interestingly there's also std::io::Read::read_buf in the nightly std which takes MaybeUninit<u8> (which tun-rs could use) but it is never stabilized.
Edit: I changed recv signatures to accept MaybeUninit and do one unsafe cast in the end, I am hoping to drop it once tun-rs accepts MaybeUninit. Please take a look at e995ffa 👀
45e2ed8 to
4642134
Compare
Add the `gso` module to lightway-core with VirtioNetHdr definition, checksum helpers, and segment build/count functions for splitting GSO superpackets into individual segments with correct per-segment header fixups (IP ID, TCP seq, checksums). Also add tun-rs workspace dependency to lightway-core and lightway-server Cargo.toml.
Add the `send_gso` method to the OutsideIOSendCallback trait for sending concatenated wire packets via kernel GSO (UDP_SEGMENT). Include todo!() stub implementations in client TCP/UDP, server TCP, and test harnesses to satisfy the trait contract.
Add gso_buf/gso_size fields to TlsIOAdapter so the wolfssl send() callback can buffer raw encrypted segments during GSO processing. Add udp_send_gso to wrap buffered segments with wire headers and send as one sendmsg via the vectored send_gso callback. The implementation uses a zero-copy fast path when no outside plugins are configured: scatter-gather via iovec with a shared header buffer and borrowed slices of the encrypted segment buffer. The plugin path builds each segment as its own BytesMut and enforces the uniform-stride requirement of UDP_SEGMENT.
Add inside_data_received_gso and send_to_outside_gso methods to Connection. These process a GSO superpacket as a single packet through plugins/encoder, then split into per-segment encrypted frames and collect into a wire buffer for batch send via UDP_SEGMENT.
Add offload config field to TunConfig to enable IFF_VNET_HDR on TUN devices. Add recv_gso for raw reads that include the virtio_net_hdr prefix, and prepend a zeroed virtio header on try_send when offload is enabled.
Extend send_to_socket to accept an optional gso_size parameter and build UDP_SEGMENT cmsg for kernel-level segmentation. Implement the real send_gso on UdpSocket using this path.
Add enable_tun_offload config option and wire it through ServerConfig to main. Extract the default inside IO loop into its own function and add inside_io_loop_gso that reads virtio-framed superpackets from TUN, dispatches GSO vs single-packet paths, and sets gso_max_size on the TUN device.
Lets the GSO recv loop use BytesMut::spare_capacity_mut() directly, dropping the one-time 65 KB zero-init and one of the two call-site unsafe blocks. The cast back to &mut [u8] now lives only at the syscall boundary in TunDirect::recv_gso, with a comment pointing at the tun-rs upstream gap to track for cleanup.
Switch build_segment to take `&mut [MaybeUninit<u8>]` and return the initialized prefix as `&mut [u8]`. The function fully overwrites bytes 0..total_len through `<[MaybeUninit<u8>]>::write_copy_of_slice` before any read, so no `&mut [u8]` is ever constructed over uninit memory; the `&mut [u8]` reborrow at the end is a single local `unsafe` covering bytes the prior writes initialized. On the send side, this lets send_to_outside_gso drop the up-front 65 KB-equivalent zero-init (`BytesMut::zeroed(mtu)`) and the per-iter `set_len(mtu)` unsafe block. The loop now uses `clear()` + `spare_capacity_mut()`, identical in shape to `inside_io_loop_gso`'s recv buffer — one unsafe `set_len(total_len)` per segment, no cast.
New Earthfile target `run-udp-tun-offload-test` runs the standard UDP e2e against a server started with `--enable-tun-offload`, exercising the TUN IFF_VNET_HDR + UDP_SEGMENT + GRO path end-to-end. Wired into `run-all-tests` so CI picks it up alongside the other UDP variants.
4642134 to
39cf489
Compare
kp-samuel-tam
left a comment
There was a problem hiding this comment.
Some updates and comments to help with reviews.
| Tun::Direct(t) => t.recv_gso(buf).await, | ||
| #[cfg(feature = "io-uring")] | ||
| Tun::IoUring(_) => { | ||
| IOCallbackResult::Err(std::io::Error::from(std::io::ErrorKind::Unsupported)) |
There was a problem hiding this comment.
I changed from todo!() to this.
| // SAFETY: `tun_rs::AsyncDevice::recv` takes `&mut [u8]` and forwards | ||
| // to `libc::read(2)`. The kernel only writes — it never dereferences | ||
| // userspace memory for reading — so handing it our uninitialized slab | ||
| // is sound at the syscall boundary. The unsoundness lives in *Rust*: | ||
| // constructing a `&mut [u8]` over uninitialized bytes is UB per strict | ||
| // aliasing rules, even if no one reads them. This cast is the only | ||
| // place we paper over that gap. Delete it (and revert this signature) | ||
| // once `tun-rs` exposes a `MaybeUninit`-aware recv. | ||
| #[allow(unsafe_code)] | ||
| let raw = | ||
| unsafe { std::slice::from_raw_parts_mut(buf.as_mut_ptr().cast::<u8>(), buf.len()) }; |
There was a problem hiding this comment.
This is new to cast MaybeUninit away for tun.recv() which doesn't support MaybeUninit yet.
|
|
||
| /// Raw read from Tun, returning the full virtio frame (header + payload). | ||
| #[cfg(target_os = "linux")] | ||
| pub async fn recv_gso(&self, buf: &mut [std::mem::MaybeUninit<u8>]) -> IOCallbackResult<usize> { |
There was a problem hiding this comment.
A lot of read/recv signatures are changed from &mut [u8] to &mut [MaybeUninit<u8>].
| // IFF_VNET_HDR requires a zeroed `virtio_net_hdr` prefix | ||
| // on every write (NEEDS_CSUM=0, GSO_NONE). | ||
| let hdr_len = tun_rs::VIRTIO_NET_HDR_LEN; | ||
| let mut prefixed = bytes::BytesMut::zeroed(hdr_len); | ||
| prefixed.extend_from_slice(&buf[..]); | ||
| tun.try_send(&prefixed[..]) | ||
| .map(|n| n.saturating_sub(hdr_len)) |
There was a problem hiding this comment.
This can be optimized to be writev but I'd defer it for now.
| /// When `Some`, the wolfssl IO callback `send()` appends raw | ||
| /// encrypted segments here instead of sending to the socket. | ||
| /// After all segments are collected, `udp_send_gso` wraps each | ||
| /// with `wire::Header`, runs plugins, and sends via `send_gso`. | ||
| pub(crate) gso_buf: Option<BytesMut>, |
There was a problem hiding this comment.
This is not new but this is a buffer to "intercept" wolfssl doing send immediately after the packet is encrypted. Now when this is Some(_) it will store packets here and return early; for us to do GSO instead of shooting individual packets out right away.
Edit: A future plan is to do streaming/in-place encryption on the original superpacket, so this would not be needed... but deferred
| @@ -48,32 +49,45 @@ impl std::fmt::Display for BindMode { | |||
|
|
|||
| fn send_to_socket( | |||
There was a problem hiding this comment.
Not new, but did additional refactoring on fn send_to_socket, now .with_control() is unconditional but kernel handles a cmsg_len = 0 cmsg just fine.
| send_to_socket(&self.sock, buf, &peer_addr.1, self.reply_pktinfo) | ||
| send_to_socket( | ||
| &self.sock, | ||
| &[IoSlice::new(buf)], |
There was a problem hiding this comment.
Not new, but this is changed buf -> &[IoSlice::new(buf)] because send_to_socket signature changed for iovec aka scatter/gather IO.
| } | ||
| } | ||
|
|
||
| async fn inside_io_loop_default( |
There was a problem hiding this comment.
This is pulled directly from the inside_io_loop tokio::spawn code and unchanged.
| } | ||
|
|
||
| #[cfg(target_os = "linux")] | ||
| async fn inside_io_loop_gso( |
There was a problem hiding this comment.
This is the GSO replacement for original inside_io_loop to handle the vnet_hdr and other GSO stuff.
| # headers per segment can exceed what sendmsg(UDP_SEGMENT) accepts, causing the | ||
| # kernel to return EMSGSIZE ("Message too long") on the outside socket. | ||
| ip link set dev "${tunname}" gso_max_size 60300 | ||
| ip link set dev "${tunname}" up |
There was a problem hiding this comment.
This is new to cap gso_max_size to not blow up sendmsg(UDP_SEGMENT) later. We also need this in deployment scripts because it's not in the lightway code anymore.
| /// `out` holds exactly the one segment's wire bytes. | ||
| /// | ||
| /// `out.capacity()` must be ≥ one segment's maximum wire length. | ||
| pub(crate) fn build_segment( |
There was a problem hiding this comment.
I added some new tests below for this function - this is run to construct a complete, checksum-computed packet for wolfssl to encrypt. Could be written in IO scatter/gather style and avoid the copies if we use the wolfssl streaming/in-place encrypt APIs.
Description
Implement GSO on server side on DTLS and Expresslane, specifically on bulk server->client traffic. This consistently halves the total syscalls used during bulk transfers, and also improves aggregated server throughput by 2x for multiple clients doing transfers.
When
--enable-tun-offloadis set, the server reads TSO superpackets from the TUN withIFF_VNET_HDR, segments them in userspace, and emits each superpacket as a singlesendmsg(UDP_SEGMENT)instead of N per-segment syscalls. On a single-flow iperf3 reverse test the kernel UDP send path collapses near-completely:udp_sendmsg0.71% → ~0.05%,sock_alloc_send_pskb2.61% → ~0.13%,mlx5e_xmit1.88% → ~0%.Trade-off: kernel work is replaced with userspace work (per-segment IP/TCP/UDP checksum recomputation, segment assembly). Kernel-side wins are clear and measurable; userspace cost is now the dominant factor.
Pacing: each
sendmsg(UDP_SEGMENT)produces a NIC burst of up to N segments. This can exceed receiver socket buffer depth and increase tail drops at peak rates. We will need to revisit better TX pacing under congested links.Future work will focus on compatibility with TUN backends like io_uring, GRO on the server side, and full GSO/GRO on client side, where single-flow workloads should see the biggest visible speedup not in this PR.
Motivation and Context
See ticket CVPN-2346.
How Has This Been Tested?
Types of changes
Checklist:
main