archive: vsock-based TCP port forwarding (abandoned)#244
archive: vsock-based TCP port forwarding (abandoned)#244
Conversation
…t register constants - Add `HvVm::exit_all_vcpus()` safe wrapper around `hv_vcpus_exit(NULL, 0)` for clean shutdown of all blocked vCPU run loops from any thread. - Fix memory leak in `Gic::save_state()`: the buffer returned by `hv_gic_get_state` is malloc-allocated and must be freed by the caller. Added `libc::free()` after copying data to `Vec<u8>`. - Export ARM64 register constants (`HV_REG_X0`..`HV_REG_CPSR`) and system register constants via `arcbox_hv::reg` and `arcbox_hv::sys_reg` modules. Updated `arcbox-vmm/darwin_hv.rs` to use the canonical exports instead of duplicated local constants. ABX-285
…(ABX-286, ABX-288) Add VmBackend/ResolvedBackend enums to replace use_custom_vmm bool, enabling proper backend selection between Hypervisor.framework (HV) and Virtualization.framework (VZ). Wire up resolve_backend() dispatch in Vmm::initialize(), start(), and stop(). Key changes: - Add VmBackend (Auto/Hv/Vz) and ResolvedBackend enums to VmmConfig - Add HV-specific fields to Vmm struct (hv_vm, hv_guest_ram, hv_gic, hv_running, hv_vcpu_threads, hv_fdt_addr, resolved_backend) - Remove mem::forget leaks; store VM/RAM/GIC in Vmm struct - Implement start_darwin_hv() and stop_darwin_hv() lifecycle methods - Add minimal PL011 UART emulator for early kernel boot console output - Fix VIRTIO_MMIO_BASE from 0x0900_0000 to 0x0A00_0000 (PL011 conflict) - Add PC advancement after DataAbort, HVC, and SMC exits - Add PSCI CPU_ON/CPU_OFF/VERSION/AFFINITY_INFO handling - Add tests for PL011 UART and address range non-overlap
… (ABX-287) Add guest-memory-backed VirtQueue, wire QUEUE_NOTIFY to interrupt injection, and introduce virtio-rng device for the Hypervisor.framework custom VMM path. - queue_guest: GuestMemoryVirtQueue provides zero-copy descriptor chain walking, available/used ring ops, and read/write helpers over guest physical memory via direct host pointer arithmetic - device: DeviceManager gains guest_ram_base, irq_callback, and build_guest_queue helper; QUEUE_NOTIFY now triggers MMIO interrupt status update and GIC signal via callback - rng: minimal virtio-rng (VirtioDeviceId::Rng = 4) for /dev/hwrng
… [M1] Replace the empty DeviceManager with actual VirtIO device registration in initialize_darwin_hv(). When the guest kernel sends QUEUE_NOTIFY, the handler now reads descriptors from guest memory and calls each device's process_queue() method for real I/O processing. Key changes: - Add VirtioDevice::process_queue() trait method with QueueConfig for guest-memory-based descriptor chain walking - Implement process_queue for VirtioBlock (reads avail ring, walks descriptor chains, calls process_descriptor_chain, updates used ring) - Implement process_queue for VirtioFs (delegates to internal queue processing with hiprio/request queue index translation) - Add DeviceManager.set_guest_memory() and set_irq_callback() for the custom HV path to provide guest RAM access and interrupt injection - Rewrite QUEUE_NOTIFY handler to build QueueConfig from MMIO state, call device.process_queue(), and inject GIC interrupt on completions - Register console, virtiofs, block, net (with TSO), and vsock devices in initialize_darwin_hv() using DeviceManager.register_virtio_device() - Add MemoryManager::with_mmio_base() so the HV path allocates MMIO starting at 0x0900_0000 (matching ARM64 VirtIO MMIO layout) - Use device_manager.device_tree_entries() for FDT generation instead of the old manual slot allocation
Add VirtIO vsock packet parsing and queue-based processing needed for arcbox-agent to communicate with the host via Hypervisor.framework. - Add VsockHeader.from_bytes/to_bytes for packet serialization - Add process_tx_queue: pops TX descriptors, parses vsock headers, and dispatches OP_REQUEST/OP_RW/OP_SHUTDOWN/OP_CREDIT_* to the backend - Add inject_rx_packet: writes host->guest response packets into the RX virtqueue descriptor chain - Add process_queue dispatcher (queue index 1 -> TX) - Add HostVsockBackend for HV: maps guest vsock ports to host-side Unix domain sockets for arcbox-agent RPC forwarding - Create RX/TX/Event virtqueues on device activation
…king (M3) Add multi-core guest support to the custom Hypervisor.framework VMM: - PSCI CPU_ON (0xC4000003): secondary vCPUs are spawned in a parked state waiting on an mpsc channel. When the BSP issues CPU_ON, the target vCPU receives entry_point and context_id, creates an HvVcpu on its own thread (required by HV.framework's !Send constraint), and enters the run loop. Double-start returns ALREADY_ON per spec. - WFI blocking: replace yield_now() with park_timeout(1ms). The GIC IRQ callback now unparks all registered vCPU threads on interrupt assertion, giving prompt wakeup while saving CPU during idle. - Refactor vcpu_run_loop() to accept entry_addr/x0_value parameters so both BSP and secondary vCPUs share the same run loop code. - Extract handle_psci() for PSCI dispatch (VERSION, SYSTEM_OFF, SYSTEM_RESET, CPU_ON_64), also handling SMC conduit identically to HVC for guest compatibility. - Add start_darwin_hv() to Vmm for spawning BSP + secondary threads.
…es, READDIRPLUS (ABX-289) - Adaptive negative cache TTL: path-pattern-based TTLs for the negative cache. Stable directories (node_modules 30s, .git 60s, target 30s) get longer TTLs while source files default to 5s. Eliminates repeated stat() calls for known-absent files in dependency trees. - Per-share cache profiles: CacheProfile enum (Static/Dynamic/Custom) controls FUSE entry and attr timeout values. Static shares get 300s timeouts for container images; Dynamic shares get 1s for source dirs. - READDIRPLUS support: negotiate FUSE_DO_READDIRPLUS during INIT and handle opcode 44. Each directory entry now includes a full FuseEntryOut with attributes and cache timeouts, eliminating separate LOOKUP calls after directory listing.
Standalone crate at tests/bench-virtiofs/ with micro-benchmarks (sequential/random I/O, metadata ops, negative lookups) and macro-benchmarks (npm install, git clone, rm -rf, find). Includes JSON report output, baseline comparison with regression detection, and performance targets as % of native macOS filesystem speed.
- Update arcbox-hv GIC FFI bindings for Xcode 26 SDK's config-based API: hv_gic_config_create() + hv_gic_config_set_distributor_base() instead of the old hv_gic_create(NULL) pattern - Fix critical GPA layout: move RAM_BASE_IPA from 0x0 to 0x40000000 to avoid conflict with GIC (0x0800_0000), PL011 (0x0900_0000), and VirtIO MMIO (0x0A00_0000) address ranges - Adjust kernel entry, FDT, and initrd GPAs to account for new RAM base - Add E2E test example (hv_boot_test) and diagnostic probes GIC now initializes successfully. Kernel starts executing but hits InstructionAbort at low address — needs further investigation of ARM64 boot state initialization (VBAR_EL1, SCTLR_EL1, etc).
…tion Comprehensive plan covering: - rust-vmm crate evaluation (20+ crates, macOS compatibility verdict) - GPA memory layout design - Boot sequence specification - vCPU run loop and interrupt flow design - 4-phase adoption plan (linux-loader → vm-memory → device stack → production) - Crate replacement vs keep decisions with rationale - Risk assessment and file change map
…nux-loader, vm-fdt) Replace manual memory allocation, kernel loading, and FDT generation in initialize_darwin_hv() with standardized rust-vmm ecosystem crates: - GuestRam (alloc_zeroed + raw pointer) → vm-memory GuestMemoryMmap with type-safe GPA access and mmap-backed regions - load_kernel_into_ram (manual file read) → linux-loader PE::load which handles ARM64 Image header parsing and text_offset - FdtBuilder/generate_fdt (custom implementation) → vm-fdt FdtWriter with proper GICv3, timer, PSCI, PL011, and VirtIO MMIO nodes - Initrd placement now dynamically follows kernel end instead of using a fixed address Also adds MPIDR_EL1 and SCTLR_EL1 register setup to the vCPU boot state, and GIC address constants for the FDT. Old helpers (GuestRam, load_kernel_into_ram, build_hv_fdt_config) are retained with #[allow(dead_code)] for the VZ path and tests.
…indings
Add virtio-bindings v0.2 as a dependency and replace manually defined
magic numbers with canonical constants from the Linux kernel headers.
What changed:
- DeviceStatus flags now source values from virtio_config::VIRTIO_CONFIG_S_*
- Descriptor flags (NEXT/WRITE/INDIRECT) use virtio_ring::VRING_DESC_F_*
- VIRTIO_F_EVENT_IDX uses virtio_ring::VIRTIO_RING_F_EVENT_IDX
- VIRTIO_F_VERSION_1 uses virtio_config::VIRTIO_F_VERSION_1 across all devices
- VirtioNet feature bits use virtio_net::VIRTIO_NET_F_* bit positions
- VirtioNet header GSO types/flags use virtio_net::VIRTIO_NET_HDR_GSO_*
- VirtioBlock feature bits use virtio_blk::VIRTIO_BLK_F_*
- BlockRequestType enum values use virtio_blk::VIRTIO_BLK_T_*
- BlockStatus enum values use virtio_blk::VIRTIO_BLK_S_*
- MMIO register offsets in device.rs use virtio_mmio::VIRTIO_MMIO_*
- Interrupt reason magic `1` replaced with virtio_mmio::INT_VRING
- Magic number descriptor flag checks (0x1, 0x2) in queue_guest.rs
replaced with crate::queue::flags::{NEXT,WRITE}
Kept as-is:
- VirtioDeviceId enum (typed enum is better than raw u32 constants)
- MMIO magic value 0x74726976, version 2, vendor ID 0x554D4551
(these are runtime register *values*, not constants in virtio-bindings)
Added #[cfg(test)] assertions verifying VirtioDeviceId and DeviceStatus
values match their virtio-bindings counterparts.
Re-exported virtio_bindings from arcbox-virtio::lib.rs for downstream use.
Phase 2 infrastructure improvements: - Add virtio-bindings 0.2: replace 100+ hand-written VirtIO constants across 10 files with authoritative Linux header-derived values (device IDs, feature bits, descriptor flags, MMIO registers) - Fix PL011 UART address: move from 0x0900_0000 to 0x0B00_0000. The old address was inside the GIC redistributor region (0x080A_0000 + 32MB = 0x0A0A_0000), causing MMIO writes to be absorbed by the hardware GIC instead of generating VM exits. - Add vm-superio 0.8 dependency (for future UART 16550A migration when kernel is built with CONFIG_SERIAL_8250_EARLYCON) - Update MMIO region overlap tests for new address layout vm-superio UART replacement deferred: current kernel only has PL011 earlycon compiled in, not 8250. Will switch when we control kernel build.
…/init Three fixes that advance boot from "kernel starts" to "init process runs": 1. VirtIO MMIO base: 0x0A00_0000 → 0x0C00_0000 Old address was inside GIC redistributor region (0x080A_0000 + 32MB), causing kernel -EBUSY when requesting the MMIO resource. 2. Initrd placement: kernel_end+4K → RAM_BASE+128MB Placing initrd immediately after kernel caused corruption during early boot memory setup. Fixed offset provides safe distance. 3. Remove unnecessary gzip decompression — kernel handles it internally. E2E results: - Kernel boots in <2s, reaches "Run /init as init process" - VirtIO console device probed and activated successfully - Initramfs unpacked (26712K freed) - Init script runs, attempts VirtioFS mounts (expected to fail in this test config — no fs/block/vsock devices registered)
Modern VirtIO MMIO (version 2) requires all devices to advertise
VIRTIO_F_VERSION_1 (bit 32). Without it, the kernel rejects the
device with "must provide VIRTIO_F_VERSION_1 feature, probe failed -22".
Also enables VirtioFS, vsock, and console in the hv_boot_test example
with a temporary shared directory.
E2E results:
- All 3 VirtIO devices probe and activate successfully
- VirtioFs tag='arcbox' activated
- Vsock CID=3 activated
- Console 80x25 activated
- Kernel reaches "Run /init"
- Init script runs but virtiofs mount fails ("tag not found")
— config space tag delivery needs investigation
…E test
Add debug logging to QUEUE_NOTIFY handler to trace device/memory state.
Enable VirtioFS (tag=arcbox), vsock (CID=3), and console in hv_boot_test.
E2E status:
- All 3 devices probe + activate successfully
- Kernel boots to /init in 1.7s
- Console gets QUEUE_NOTIFY with correct device/memory state
- VirtioFS tag='arcbox' activated but mount fails ("tag not found")
Root cause: kernel virtiofs driver sends FUSE_INIT to queue but
process_queue uses GPA as memory offset — wrong when RAM_BASE != 0.
The guest_mem slice passed to process_queue starts at offset 0 but
descriptor addresses are GPAs (0x40000000+), causing out-of-bounds.
VirtQueue descriptor addresses are Guest Physical Addresses (GPAs) like 0x40001000, but the memory slice passed to process_queue started at the host pointer corresponding to RAM_BASE_IPA (0x40000000). Using GPA directly as a slice index caused out-of-bounds access. Fix: offset the memory slice pointer backward by RAM_BASE_IPA so that GPA 0 maps to index 0. This lets device code use `desc.addr as usize` directly without translation. Also add `gpa_base` field to QueueConfig and `guest_ram_gpa` to DeviceManager for proper address tracking. E2E result: "tag <arcbox> not found" errors are gone. Init process runs without virtio-fs mount errors on serial.
…visible Implement process_queue for VirtioConsole TX path: - Parse descriptor chains from guest memory - Extract TX data and emit via tracing (guest_console target) - Update used ring and return completions for interrupt injection Switch kernel cmdline from console=ttyAMA0 to console=hvc0 so kernel output transitions from PL011 earlycon to VirtIO console after boot. E2E results — full init script execution: ✅ Kernel boots in 1.7s ✅ PL011 earlycon → hvc0 console handoff ✅ VirtioFS mounted at /arcbox ✅ Init script runs to arcbox-agent exec⚠️ vsock modules failed to load (built-in, modprobe returns error)⚠️ arcbox-agent needs vsock connection to host
…k device) - Use kernel + rootfs.erofs (virtio-blk) instead of initramfs - Match production cmdline: console=hvc0 root=/dev/vda ro rootfstype=erofs - Export BlockDeviceConfig from arcbox-vmm E2E results with rootfs.erofs: ✅ virtio_blk virtio1: [vda] 12336 512-byte logical blocks (6.32 MB) ✅ Block device I/O processes 1 completion⚠️ Feature negotiation issue: driver_features_sel gets corrupted value causing acked features = 0x0. Kernel still proceeds but I/O stalls after first request (interrupt not delivered due to features mismatch).
Two critical bugs fixed: 1. ARM64 XZR (zero register) mishandled as SP: When guest executes STR WZR (register 31), the MMIO write handler read vcpu.get_reg(31) which returns SP, not zero. This corrupted DRIVER_FEATURES_SEL with stack pointer values, breaking VirtIO feature negotiation for all devices. Fix: treat register 31 as XZR (always 0) in MMIO write paths. 2. IRQ-to-GSI mapping wrong for ARM64 GIC: allocate_irq() used `irq % MAX_GSIS` (MAX_GSIS=24) for GSI mapping, designed for x86 IOAPIC. IRQ 34 mapped to GSI 10, but FDT declares SPI 34 — kernel never received the interrupt. Fix: map IRQ N directly to GSI N (ARM64 GIC supports 1020 SPIs). E2E results after fixes: ✅ Feature negotiation: all devices ack non-zero features ✅ GIC interrupt delivery: SPI 34 correctly injected⚠️ Block I/O: first request processed but kernel stalls waiting for completion — used ring update may be incorrect
Two fixes that complete block I/O and enable rootfs mount: 1. GIC SPI number mismatch in FDT: FDT `interrupts = <0 N flags>` encodes SPI number N, which maps to hardware INTID N+32. But hv_gic_set_spi() takes the raw INTID. Our IRQ allocator starts at 32, so entry.irq=32 → set_spi(32)=INTID 32, but FDT <0,32,1> means SPI 32 = INTID 64. Kernel never received the interrupt. Fix: FDT now uses `entry.irq - 32` as the SPI number so that INTID matches: FDT SPI N → INTID N+32 = entry.irq. 2. Missing write memory barrier in virtio-blk used ring update: ARM64 weak ordering requires a barrier between writing ring entries and updating the used index. E2E result — PRODUCTION BOOT PATH WORKS: ✅ erofs: (device vda): mounted with root inode @ nid 36 ✅ VFS: Mounted root (erofs filesystem) readonly on device 254:0 ✅ Run /sbin/init as init process Boot time: 1.58s kernel → rootfs mount → init
- Remove debug-level DRIVER_FEATURES/QUEUE_NOTIFY diagnostic logs added during E2E debugging (MMIO write trace kept at trace level) - Remove virtio-blk processing chain/completion debug logs - Remove unused PL011 register constants (IBRD, FBRD, LCR_H, CR, IMSC) - Remove diagnostic examples (hv_gic_probe.rs, hv_gic_ram_test.rs) - Remove unused flate2 dependency (kernel handles initrd decompression) - Update architecture doc GPA layout to reflect actual addresses: PL011 at 0x0B00_0000, VirtIO MMIO at 0x0C00_0000 - Clarify IRQ GSI mapping comment for ARM64/x86 compatibility
…ppressions Move code superseded by rust-vmm crates behind #[cfg(test)]: - GuestRam (replaced by vm-memory GuestMemoryMmap) - DeviceSlot, allocate_device_slot, build_device_tree_entries (replaced by DeviceManager::register_virtio_device) - build_hv_fdt_config, choose_fdt_addr_hv (replaced by vm-fdt) - load_kernel_into_ram, load_initrd_into_ram (replaced by linux-loader) - PAGE_SIZE, VIRTIO_MMIO_SIZE, VIRTIO_IRQ_BASE constants Eliminates all #[allow(dead_code)] annotations from production code.
When VIRTIO_F_EVENT_IDX is negotiated, the driver checks avail_event in the used ring before deciding whether to send QUEUE_NOTIFY. Without updating avail_event, only the first kick succeeds — subsequent kicks are suppressed because vring_need_event() returns false. Fix: after processing completions, set avail_event = current_avail_idx in both virtio-blk and virtio-console process_queue implementations. This tells the driver "notify me on the next request". Impact: - Block I/O: 201 reads complete (was stuck after 1) - Init process executes fully — reads busybox from EROFS rootfs - VirtioFS receives FUSE_INIT (QUEUE_NOTIFY for request queue) - Console TX continues flowing after boot
Root cause: VirtioFs::new() creates a device without a FuseRequestHandler. FUSE_INIT succeeds (handled by FuseSession internally), but all subsequent filesystem operations return ENOSYS because handler is None. VZ backend doesn't have this issue because Apple's Virtualization.framework handles the entire FUSE protocol internally. Fix: create and start an FsServer (arcbox-fs) for each VirtioFS share, attach it via VirtioFs::with_handler(). Add arcbox-fs as dependency. Status: FsServer starts, QUEUE_NOTIFY for FUSE_INIT arrives at ~10s after boot. Response delivery needs investigation (mount still blocks).
The VirtioFs trait process_queue was calling the inherent process_queue which uses internal VirtQueue objects (host-side data structures). These are never populated from guest memory, so pop_avail() always returned empty — FUSE requests were never processed. Rewrite to read descriptors directly from guest memory via queue_config addresses (same pattern as VirtioBlock), then call process_request() for each FUSE request and write responses back to guest write-only descriptors. E2E result — FULL PRODUCTION BOOT PATH WORKS: ✅ Kernel boots, EROFS rootfs mounted from virtio-blk ✅ Init executes /sbin/init ✅ VirtioFS mount -t virtiofs arcbox /arcbox SUCCEEDS ✅ FUSE_INIT handshake: version 7.38, max_readahead=131072 ✅ exec /arcbox/bin/arcbox-agent (fails: not found in test dir) The only remaining step is providing the real arcbox-agent binary in the VirtioFS share directory.
…x-daemon Wire the HV backend as the default for VmBackend::Auto on ARM64: - Add `gic` feature propagation through arcbox-core and arcbox-daemon - Fix resolve_backend: Auto selects HV (VZ only when explicitly chosen) Daemon successfully boots with HV backend: ✅ VM backend resolved: Hv (requested: Auto) ✅ Hypervisor.framework VMM initialized with GICv3 ✅ 7 VirtIO devices: 2x VirtioFS, 2x block, console, vsock, net ✅ EROFS rootfs mounted, /sbin/init executed ✅ VirtioFS shares: tag=arcbox (data_dir), tag=users (/Users) ✅ FUSE_INIT handshake successful ✅ arcbox-agent binary found and executed Remaining: vsock host backend for agent RPC connection.
Add HV backend vsock connection: - connect_vsock_hv() creates Unix socketpair for host↔guest data path - connect_vsock() in darwin.rs dispatches to HV or VZ based on backend - Socketpair established: host_fd for daemon, internal_fd for forwarding Daemon successfully calls connect_vsock for agent on port 1024. Data forwarding (socketpair ↔ VirtIO vsock queues) not yet wired.
Wire vsock host connections through DeviceManager shared state: - Add vsock_host_fds to DeviceManager (Arc<Mutex<HashMap<port, fd>>>) - Pass fds via QueueConfig to VirtioVsock process_queue - VirtioVsock reads TX packets from guest memory, writes OP_RW payload to host fd via libc::write - connect_vsock_hv registers fd in shared map TX direction (guest→host) is wired. RX direction (host→guest) still needs injection into guest RX queue — requires polling host fds during vCPU idle (WFI exit) and injecting vsock packets.
Add host→guest vsock data path: - DeviceManager::poll_vsock_rx() reads from host fds (non-blocking), builds vsock OP_RW packets, injects into guest RX queue - Called during vCPU WFI exit (guest idle) - Triggers GIC interrupt after injection to wake guest - DeviceManager::trigger_irq_callback() public accessor for IRQ Missing: vsock OP_REQUEST/OP_RESPONSE connection handshake. Currently host writes raw data but guest agent hasn't received a connection request yet. Need to inject OP_REQUEST into RX queue when connect_vsock_hv is called.
Move vsock RX injection (host -> guest) from synchronous vCPU loop polling to a dedicated kqueue-based I/O thread, matching the net_rx_worker pattern. The old poll_vsock_rx Phase 1+2 (linear MSG_PEEK per fd, single 4KB read per connection per cycle) is replaced by: - kqueue event-driven fd monitoring - 64KB bulk reads per connection per wakeup - Batch up to 64 packets with 50us interrupt coalescing - Dynamic fd registration for new/closed connections poll_vsock_rx() now only processes the TX queue (guest -> host).
When the vsock RX worker exhausts the credit window (256KB), it must: 1. Flush any pending interrupt so the guest processes queued data 2. Inject a CreditRequest + fire interrupt immediately 3. Yield 100us to let the vCPU process the guest's CreditUpdate TX Previously, the worker just sent CreditRequest and broke out without flushing, causing the guest to never process the queued data, meaning it never sent CreditUpdate, creating a permanent credit deadlock. Also adds diagnostic logging for credit=0 events showing rx_cnt and peer_fwd_cnt to diagnose any remaining credit flow issues.
Root cause: the vsock RX worker was reading payload up to 8192 bytes but injecting header(44) + payload into descriptors that only held 8192 bytes. The 44-byte overflow caused truncated packets that the guest silently dropped, while record_rx() had already debited the credit — creating a permanent credit deadlock. Fixes: - peek_rx_capacity(): pre-check descriptor chain capacity before read - Read limit = min(credit, descriptor_capacity - HEADER_SIZE, 64KB) - inject_packet() returns written bytes; only commits used ring if entire packet was written (no partial commits) - record_rx() moved AFTER successful full injection - injected_notify fired AFTER inject_packet succeeds (not before) - inject_vsock_connect() disabled — worker is sole RX queue writer, preventing used_idx conflicts between threads Result: 1.68 Mbps -> 7.86 Gbps (4,678x improvement)
The credit-notify pipe caused stalls in testing — revert it. The pipe woke the worker on every TX completion (not just CreditUpdate), creating excessive scheduling overhead. Keep the safe improvements: - BATCH_SIZE 64 -> 128 (more packets per kqueue wakeup) - DESCRIPTOR_BACKOFF 100us -> 50us - TX_BUFFER_SIZE stays at 1 MiB (controls guest->host, not the host->guest bottleneck which is guest's peer_buf_alloc=256KB) Current ceiling: 7.81 Gbps single stream, limited by guest kernel's 256 KiB vsock buf_alloc / ~260us credit RTT. Breaking past this requires increasing the guest kernel's vsock buffer size (sysctl or kernel build config), not host-side changes.
VirtioNet config used default MTU=1500 while the datapath SmoltcpDevice used ENHANCED_ETHERNET_MTU=4000. Guest negotiated MSS=1460 instead of 3960, underutilizing the channel-based RX injection path.
Eliminate the socketpair from the host→guest data path by reading from the host TCP socket directly into guest vsock RX descriptors, following the InlineConn pattern from arcbox-net-inject. Architecture changes: - VsockInlineConn: holds promoted TCP stream + credit state - write_inline_vsock_header(): 44-byte header direct to guest buf - poll_vsock_inline_conns(): reads TCP into guest descriptor[44..] - VsockConnector::promote_inline(): trait method for promotion - After handshake, port forwarder clones TCP stream → worker - Socketpair remains for handshake + guest→host direction Guest-side optimizations: - SOL_VSOCK setsockopt: buffer 256 KiB → 8 MiB - Guest sysctl: rmem_max/wmem_max → 16 MiB - Relay buffers: 8 KiB → 256 KiB (copy_bidirectional_with_sizes) - OP_RESPONSE buf_alloc log added for diagnostics
When poll_vsock_inline_conns exhausted RX descriptors, it broke out of the loop without triggering an interrupt. The guest never refilled descriptors, causing a permanent busy-loop with zero-timeout kqueue. Also fix dangling `)` from debug log removal that prevented compilation.
1. Set vsock buffer on LISTENER fd (not just accepted socket): Linux af_vsock clones listener's buffer_size to child sockets, so OP_RESPONSE immediately carries buf_alloc=8MiB. 2. Fix descriptor starvation busy-loop: add DESCRIPTOR_BACKOFF sleep and only use zero-timeout kqueue when descriptors are available. 3. Remove local_rx_cnt — use manager's rx_cnt as single source of truth for credit calculation. OP_RESPONSE now shows buf_alloc=8388608. Inline inject transfers ~1.73 Gbps for 3 seconds then stalls (receiver=0). Header format investigation needed.
The into_std()/from_std() roundtrip on the TCP stream broke the tokio reactor registration, causing ALL connections to hang even when inline promotion returned false. Also properly disable promote_inline at the connector level (commenting out the implementation, not just adding if-false guards). inject_vsock_connect restored with inject_vsock_rx_raw — still has potential RX queue conflict with the worker thread. Need to add a mutex or make the worker the sole RX writer with reliable drain_control_packets.
Diagnostic logging reveals: - 8 MiB buf_alloc causes guest kernel to stop sending CreditUpdates (receiver=0 despite 2.7 GB injected). Reduced to 2 MiB. - 2 MiB works: credit properly cycles (rx_cnt tracks peer_fwd) - Stale connections from previous sessions cause persistent credit=0 busy-loop on dead socketpair fds Next steps: clean up stale connections, test 2MiB baseline throughput.
When in-flight bytes (rx_cnt - peer_fwd_cnt) exceeded peer_buf_alloc, Wrapping<u32> subtraction wrapped to ~4GB phantom credit instead of returning 0. The host sent unlimited data, overwhelming the guest. Same bug as Linux kernel CVE-2026-23069 (commit 60316d7f10b17a7). Fix: use saturating_sub for the capacity check while keeping Wrapping for counter difference (which correctly handles u32 monotonic wrap). Also removes stale connections on EOF/EBADF detection during credit=0 probe, preventing busy-loop on dead socketpair fds.
1. peer_avail_credit() u32 wrapping underflow (CVE-2026-23069 equivalent): When in-flight bytes exceeded peer_buf_alloc, Wrapping<u32> subtraction wrapped to ~4GB phantom credit. Fix: use saturating_sub. 2. update_peer_credit() allowed fwd_cnt regression: Out-of-order guest TX packets could set peer_fwd_cnt to an older value, inflating in-flight calculation. Fix: only accept advancing fwd_cnt (wrapping-safe comparison with half-space rule). Also adds CREDIT OVERRUN detection log for future debugging. With both fixes: credit=0 triggers correctly at exactly buf_alloc bytes in-flight. No overrun detected. Remaining stall is guest-internal TCP backpressure (container iperf3 server saturated), not a VMM credit bug.
Three credit fixes now in place: 1. saturating_sub for peer_avail_credit (CVE-2026-23069 equivalent) 2. fwd_cnt monotonicity (prevent out-of-order regression) 3. buf_alloc max-only update (prevent regression to default 262144) Guest agent changes: - splice(vsock→pipe→tcp) relay replaces copy_bidirectional - 4 MiB TCP SO_SNDBUF/SO_RCVBUF on container connection - SOL_VSOCK 2 MiB buf_alloc on listener fd Root cause of persistent stall identified: guest kernel RCU stall. Host worker injects data faster than guest kernel can process via softirq/NAPI, starving all guest CPUs. The CreditUpdate never fires because the guest agent process never gets scheduled. Next: rate-limit host injection or add interrupt coalescing tuning to prevent guest CPU starvation.
Three tests covering the u32 wrapping underflow bug (CVE-2026-23069 equivalent) that caused phantom credit: 1. peer_avail_credit_saturates_to_zero_when_in_flight_exceeds_buf_alloc - in_flight=300000 > buf_alloc=262144 → must return 0, not ~4GB 2. peer_avail_credit_is_correct_when_within_window - Normal operation: credit decreases with rx, increases with fwd_cnt 3. peer_avail_credit_handles_wrapping_counters_but_not_outer_underflow - rx_cnt/peer_fwd_cnt near u32::MAX (wrapping arithmetic correct) - But outer subtraction from buf_alloc still saturates to 0 The saturating_sub fix at vsock_manager.rs:184 is the root fix. These tests prevent regression.
…dlocks Replace the unbounded injection loop with a paced architecture: - CONN_WAKEUP_CAP (512KB): per-connection byte cap per kqueue cycle - CYCLE_BYTE_CAP (256KB): global cap across all connections, matched to guest kernel's VIRTIO_VSOCK_RX_BUDGET (64 descriptors × 4KB) - INJECT_YIELD (100μs): unconditional sleep after each cycle so guest rx_work can drain descriptors via virtqueue_enable_cb - Fair-share scheduling: CYCLE_BYTE_CAP / nev prevents starvation - Queue high-water check: skip injection when avail ring < 25% free - CreditRequest with unconditional IRQ on credit=0 to solicit CreditUpdate from guest even when splice relay is backpressured - Unconditional trigger_irq on flush (not maybe_notify) to prevent EVENT_IDX suppression from causing credit deadlock
macOS default Unix socketpair buffer is ~8KB, limiting per-cycle reads to 8KB regardless of CONN_WAKEUP_CAP. Set SO_SNDBUF/SO_RCVBUF to 512KB on both ends so the worker can read a full batch per kqueue cycle. Throughput improvement: 700 Mbps → 8+ Gbps (single stream).
- Set tcp_rmem/tcp_wmem to "4096 1048576 16777216" for TCP autotuner - Set netdev_max_backlog=5000 for high-throughput loopback - Add periodic throughput logging to splice-v2t relay thread - Log splice errors with errno for stall diagnosis
- Replace tokio copy_bidirectional with blocking thread relay to eliminate async runtime from the data path - Set SO_RCVBUF=16MB on host TCP socket to delay window closure during brief socketpair backpressure events - Add relay throughput logging and slow-syscall detection - Upgrade error logging from debug to warn for relay failures
Split the host→guest relay into two threads connected by an unbounded crossbeam channel: - TCP reader: continuously drains the host TCP recv buffer - Socketpair writer: writes to the vsock socketpair at the VMM's pace This prevents the macOS TCP window from closing during socketpair backpressure events. macOS caps socket buffers at 8MB (kern.ipc.maxsockbuf), so any blocking write >~8ms at 8 Gbps causes TCP persist-mode backoff. The unbounded channel trades memory for stability — peak usage occurs only during backpressure bursts and drains when the worker catches up. Note: the current architecture still has a throughput mismatch issue where the host TCP reader outpaces the vsock injection rate, causing the channel to accumulate data. This will be addressed by adaptive pacing in a follow-up.
New vsock_muxer.rs implements a single kqueue-based event loop that handles both TX processing (guest→host) and RX injection (host→guest), replacing the split vCPU-TX / dedicated-RX-worker architecture. Key design: - TX kick via pipe(2): vCPU writes 1 byte, muxer polls alongside fds - TX processing reads guest memory directly via GuestMemWriter - Credit updates from TX take effect immediately in the same cycle - No INJECT_YIELD needed (guest NAPI budget=64 handles RCU stalls) - No CYCLE_BYTE_CAP (credit is the natural limiter) - Fair-share scheduling for multi-connection fairness Not wired up yet — device.rs still spawns vsock_rx_worker. Next step: replace rx_worker spawn with muxer spawn in device.rs.
- Add vsock_tx_kick_fd pipe to DeviceManager for vCPU→muxer signaling - Replace maybe_spawn_vsock_rx_worker with maybe_spawn_vsock_muxer that captures both RX and TX queue configs - Intercept vsock TX QUEUE_NOTIFY: write pipe byte instead of inline process_queue, letting the muxer handle TX processing - Update VsockInlineConn references from vsock_rx_worker to vsock_muxer - Mark vsock_rx_worker module as deprecated dead code
Uncommitted debugging state preserved for reference. Key changes: - Fix EVENT_IDX suppression in process_tx_queue: read live avail_idx from guest memory and write back as avail_event (same pattern as RX). Previous (current_avail + 1) had wrong memory ordering. - Suppress poll_vsock_rx when muxer is active — it was being called on every vCPU MMIO exit (thousands/sec), each writing to the kick pipe and causing a 150 Mbps throughput cap. - Add drain_control diagnostic log (op, conn, injected, desc_cap). - Add step 6.5 fd registration after TX processing. - Add trace log for TX kick pipe write. This branch is being archived. See PR description for full journey.
There was a problem hiding this comment.
Pull request overview
Archives an exploration branch that prototypes a macOS Hypervisor.framework (HV) backend with guest-memory device paths and a vsock-based TCP port-forwarding relay, plus supporting utilities (virtqueue helpers, DAX mapper, blocking vsock transport, and benchmarking tooling).
Changes:
- Adds a blocking vsock connect path (VZ) and a blocking vsock transport (HV socketpair) to avoid tokio/kqueue timer stalls during rapid connect/teardown loops.
- Introduces HV-backend infrastructure: backend selection in
VmmConfig, additional HV lifecycle state, direct RX injection plumbing, and a DAX mapper interface/implementation. - Adds vsock-based TCP port forwarding components and a standalone VirtioFS benchmark crate.
Reviewed changes
Copilot reviewed 92 out of 96 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| virt/arcbox-vz/src/socket.rs | Adds connect_blocking for vsock connects without tokio. |
| virt/arcbox-vz/src/ffi/block.rs | Adds a block ABI wrapper capturing a std::sync::mpsc::Sender for blocking connect completion. |
| virt/arcbox-vmm/src/vmm/mod.rs | Replaces use_custom_vmm with backend selection and adds backend resolution plumbing/state. |
| virt/arcbox-vmm/src/vmm/darwin.rs | HV/VZ branching for vsock connect and cleanup; bridge MAC extraction tweak. |
| virt/arcbox-vmm/src/virtqueue_util.rs | Adds shared used-ring helpers and EVENT_IDX notification check. |
| virt/arcbox-vmm/src/memory.rs | Adds MemoryManager::with_mmio_base for HV MMIO layout. |
| virt/arcbox-vmm/src/lib.rs | Exposes new modules (blk worker, dax, vsock muxer, etc.) and re-exports backend enums/config. |
| virt/arcbox-vmm/src/irq.rs | Adds level-triggered IRQ allocation for VirtIO MMIO on ARM64. |
| virt/arcbox-vmm/src/dax.rs | Implements an HV-backed VirtioFS DAX mapper using hv_vm_map/unmap. |
| virt/arcbox-vmm/src/builder.rs | Updates builder defaults to use VmBackend. |
| virt/arcbox-vmm/examples/vmm_boot.rs | Updates example config to use VmBackend. |
| virt/arcbox-vmm/examples/hv_boot_test.rs | Adds an end-to-end HV backend boot test example. |
| virt/arcbox-vmm/Cargo.toml | Adds HV backend/device deps (vm-memory, linux-loader, vm-fdt, etc.) and tracing-subscriber dev dep. |
| virt/arcbox-virtio/src/rng.rs | Adds a virtio-rng device implementation. |
| virt/arcbox-virtio/src/queue.rs | Switches ring constants to virtio-bindings. |
| virt/arcbox-virtio/src/net.rs | Switches net constants/features to virtio-bindings and enables MRG_RXBUF in defaults. |
| virt/arcbox-virtio/src/lib.rs | Re-exports virtio_bindings, adds QueueConfig, and extends VirtioDevice with process_queue. |
| virt/arcbox-virtio/src/console.rs | Implements process_queue for console TX extraction on guest-memory backends. |
| virt/arcbox-virtio/Cargo.toml | Adds getrandom and virtio-bindings. |
| virt/arcbox-port-forward/src/protocol.rs | Defines a minimal vsock handshake protocol for port forwarding. |
| virt/arcbox-port-forward/src/lib.rs | Adds host-side vsock TCP port-forward crate wiring. |
| virt/arcbox-port-forward/Cargo.toml | New crate manifest for vsock port forwarding. |
| virt/arcbox-net/src/lib.rs | Exposes direct_rx module. |
| virt/arcbox-net/src/ethernet.rs | Adds partial-checksum TCP frame builder and pseudo-header checksum helper. |
| virt/arcbox-net/src/direct_rx.rs | Adds FrameSink/ConnSink traits and promoted-connection struct for RX inject fast paths. |
| virt/arcbox-net/src/darwin/vmnet.rs | Switches vmnet configuration to XPC dictionaries (not CFDictionary). |
| virt/arcbox-net/src/darwin/vmnet_ffi.rs | Updates vmnet FFI signatures/keys to XPC/const char * + adds XPC APIs. |
| virt/arcbox-net/src/darwin/smoltcp_device.rs | Refactors ARP seeding to inject two synthetic ARP replies and updates tests/logging. |
| virt/arcbox-net/src/darwin/datapath_loop.rs | Adds optional frame/conn sinks and routes guest-bound frames through them when set. |
| virt/arcbox-net/Cargo.toml | Updates smoltcp to 0.13.0 and adds crossbeam-channel. |
| virt/arcbox-net-inject/src/queue.rs | Adds virtio-net RX queue descriptor injection routine. |
| virt/arcbox-net-inject/src/notify.rs | Adds EVENT_IDX suppression helper. |
| virt/arcbox-net-inject/src/lib.rs | New crate module declarations for RX injection engine. |
| virt/arcbox-net-inject/src/irq.rs | Adds an IRQ handle abstraction for RX inject thread. |
| virt/arcbox-net-inject/src/inline_conn.rs | Adds inline (zero-copy) promoted TCP connection header/payload handling. |
| virt/arcbox-net-inject/src/inject.rs | Adds a dedicated RX injection OS thread with batching/coalescing. |
| virt/arcbox-net-inject/src/guest_mem.rs | Adds a raw guest-memory accessor for injection. |
| virt/arcbox-net-inject/Cargo.toml | New crate manifest for RX injection engine. |
| virt/arcbox-hypervisor/src/darwin/vm.rs | Switches vsock connect to connect_blocking. |
| virt/arcbox-hv/src/vm.rs | Adds exit_all_vcpus API and an ignored test. |
| virt/arcbox-hv/src/vcpu.rs | Adds raw_handle() accessor. |
| virt/arcbox-hv/src/lib.rs | Re-exports ffi and adds reg/sys_reg re-export modules; expands GIC exports. |
| virt/arcbox-hv/src/ffi.rs | Updates/extends GICv3 FFI and adds MPIDR sysreg constant. |
| virt/arcbox-hv/Cargo.toml | Adds libc dependency. |
| virt/arcbox-fs/src/server.rs | Adds DAX mapper setter; switches dispatcher timeout config to CacheProfile. |
| virt/arcbox-fs/src/passthrough.rs | Enables adaptive negative-cache TTL; adds raw fd accessor for DAX. |
| virt/arcbox-fs/src/lib.rs | Adds DaxMapper trait and CacheProfile; updates FsConfig defaults and tests. |
| virt/arcbox-fs/src/fuse.rs | Adds FUSE DAX flags and setup/remove mapping structs. |
| virt/arcbox-fs/src/error.rs | Adds macOS→Linux errno translation (ENOATTR→ENODATA). |
| tests/bench-virtiofs/src/runner.rs | Adds benchmark runner + aggregation logic. |
| tests/bench-virtiofs/src/report.rs | Adds JSON report format + comparison utilities. |
| tests/bench-virtiofs/src/main.rs | Adds CLI harness for running and comparing benchmarks. |
| tests/bench-virtiofs/src/macro_bench.rs | Adds macro-benchmarks (npm install, git clone, rm -rf, find). |
| tests/bench-virtiofs/README.md | Documents benchmark suite usage and targets. |
| tests/bench-virtiofs/Cargo.toml | New manifest for standalone bench crate (not in workspace). |
| rpc/arcbox-transport/src/vsock/transport.rs | On macOS, wraps HV socketpair fds in tokio UnixStream rather than raw AsyncFd. |
| rpc/arcbox-transport/src/vsock/stream.rs | Adds dual-mode vsock stream (AsyncFd vs tokio UnixStream). |
| rpc/arcbox-transport/src/vsock/mod.rs | Exposes blocking transport module/type. |
| rpc/arcbox-transport/src/vsock/blocking.rs | Adds a poll-based blocking transport with framed protocol + tests. |
| internal-docs/architecture/hv-backend.md | Adds internal HV backend architecture documentation. |
| guest/arcbox-agent/src/main.rs | Adds module hook for guest port-forward proxy. |
| guest/arcbox-agent/src/init.rs | Adds sysctl tuning for vsock/TCP buffers and backlog. |
| common/arcbox-logging/src/lib.rs | Notes bridging log to tracing (for smoltcp logs). |
| common/arcbox-constants/src/ports.rs | Adds port constant for guest port-forward vsock service. |
| Cargo.toml | Adds new workspace members and dependencies (arcbox-port-forward, arcbox-net-inject); enables tracing-log. |
| Cargo.lock | Locks new deps/crates (virtio-bindings, linux-loader, vm-* crates, etc.) and bumps smoltcp/heapless. |
| assets.lock | Updates boot asset version + manifest SHA. |
| app/arcbox-daemon/Cargo.toml | Enables gic by default and adds a gic feature wired to core. |
| app/arcbox-core/src/vm.rs | Updates VMM config to use VmBackend; adds vsock inline promotion API. |
| app/arcbox-core/src/vm_lifecycle/mod.rs | Adds APFS preallocation; changes agent readiness wait to blocking-probe path. |
| app/arcbox-core/src/runtime.rs | Replaces macOS inbound L2 forwarding with VsockPortForwarder and rule tracking. |
| app/arcbox-core/src/machine.rs | Changes readiness probe to blocking transport strategy; adds vsock inline promotion API. |
| app/arcbox-core/src/boot_assets.rs | Adds swiotlb=noforce to default cmdline. |
| app/arcbox-core/Cargo.toml | Adds arcbox-port-forward dependency and wires gic feature to arcbox-vmm. |
| Err(std_mpsc::RecvTimeoutError::Timeout) => { | ||
| // Do not release the block here. Virtualization.framework may | ||
| // still invoke the completion handler later, and the block owns | ||
| // the sender that callback will consume. | ||
| tracing::warn!("Vsock connection timed out after {:?}", timeout); | ||
| Err(VZError::Timeout(format!( | ||
| "Vsock connection to port {port} timed out" | ||
| ))) | ||
| } |
There was a problem hiding this comment.
connect_blocking returns on recv_timeout without ever calling _Block_release(block). Since create_blocking_vsock_context_block() uses _Block_copy, this leaks the heap block (and the boxed Sender captured by it) whenever the connect times out; the completion handler does not release the block itself. Consider arranging for the invoke/dispose path to release the block (or spawning a cleanup waiter that releases it after the callback fires) so timeouts don’t permanently leak blocks.
| /// Resolves the backend selection based on platform constraints. | ||
| /// | ||
| /// When `Auto` is selected, Rosetta requires VZ (Hypervisor.framework cannot | ||
| /// translate x86_64 instructions). Otherwise, default to VZ until the HV | ||
| /// backend is fully validated. | ||
| #[cfg(target_os = "macos")] | ||
| fn resolve_backend(config: &VmmConfig) -> ResolvedBackend { | ||
| match config.backend { | ||
| VmBackend::Vz => ResolvedBackend::Vz, | ||
| VmBackend::Hv => ResolvedBackend::Hv, | ||
| // Auto: use HV for native ARM64 workloads. VZ is only needed when | ||
| // the user explicitly requests it (VmBackend::Vz) for Rosetta x86_64 | ||
| // translation. The `enable_rosetta` flag just tells VZ to expose the | ||
| // Rosetta share — it doesn't force VZ selection. | ||
| VmBackend::Auto => ResolvedBackend::Hv, | ||
| } |
There was a problem hiding this comment.
resolve_backend’s doc comment says Auto should choose VZ when Rosetta is needed, but the implementation always returns ResolvedBackend::Hv for VmBackend::Auto and ignores config.enable_rosetta. This makes Auto inconsistent with the documented behavior and can select HV even when Rosetta translation is requested/required.
| // block_in_place is used here (instead of spawn_blocking) because | ||
| // MachineManager is not Clone/Arc at this call site. block_in_place | ||
| // is acceptable: the total blocking time is bounded by | ||
| // MAX_ATTEMPTS * MAX_DELAY_MS ≈ 60s, and the blocking RPC uses | ||
| // BlockingVsockTransport (libc::poll) — no tokio reactor interaction. | ||
| let probe_result: Result<String> = tokio::task::block_in_place(|| { | ||
| let mut delay_ms = INITIAL_DELAY_MS; | ||
|
|
||
| for attempt in 1..=MAX_ATTEMPTS { | ||
| std::thread::sleep(std::time::Duration::from_millis(delay_ms)); | ||
|
|
||
| match self.connect_agent(name) { | ||
| Ok(mut agent) if agent.is_blocking() => match agent.ping_blocking() { | ||
| Ok(resp) => { | ||
| tracing::debug!( | ||
| "Machine '{}' agent reachable (version: {}, attempt {})", | ||
| name, | ||
| resp.version, | ||
| attempt, | ||
| ); | ||
| match agent.get_system_info_blocking() { | ||
| Ok(info) => { | ||
| if let Some(ip) = select_routable_ip(&info.ip_addresses) { | ||
| return Ok(ip); | ||
| } | ||
|
|
||
| tracing::info!( | ||
| "Machine '{}' ready with IP {} (attempt {})", | ||
| tracing::trace!( | ||
| "Machine '{}' no routable IP yet (attempt {})", | ||
| name, | ||
| ip, | ||
| attempt | ||
| attempt, | ||
| ); | ||
| return Ok(()); | ||
| } | ||
|
|
||
| tracing::trace!( | ||
| "Machine '{}' system info has no routable IP yet (attempt {})", | ||
| name, | ||
| attempt | ||
| ); | ||
| } | ||
| Err(e) => { | ||
| tracing::trace!( | ||
| "Machine '{}' get_system_info attempt {} failed: {}", | ||
| Err(e) => tracing::trace!( | ||
| "Machine '{}' get_system_info failed (attempt {attempt}): {e}", | ||
| name, | ||
| attempt, | ||
| e | ||
| ); | ||
| ), | ||
| } | ||
| } | ||
| } | ||
| Err(e) => { | ||
| Err(e) => tracing::trace!( | ||
| "Machine '{}' ping failed (attempt {attempt}): {e}", | ||
| name, | ||
| ), | ||
| }, | ||
| Ok(_agent) => { | ||
| // Async transport (VZ/Linux) — skip in blocking context. | ||
| tracing::trace!( | ||
| "Machine '{}' ping attempt {} failed: {}", | ||
| "Machine '{}' async transport in blocking probe (attempt {})", | ||
| name, | ||
| attempt, | ||
| e | ||
| ); | ||
| } | ||
| }, | ||
| Err(e) => { | ||
| tracing::trace!( | ||
| "Machine '{}' connect attempt {} failed: {}", | ||
| Err(e) => tracing::trace!( |
There was a problem hiding this comment.
The readiness probe only attempts ping_blocking() when agent.is_blocking(). For async transports (VZ backend on macOS, Linux AF_VSOCK), it just logs and continues until timeout, so machines will never become “ready” on those backends. The probe should fall back to the async ping().await/get_system_info().await loop when the transport is async, and only use the blocking path when is_blocking() is true.
| // Run the entire probe loop on a blocking thread. On macOS HV backend, | ||
| // the agent transport is AF_UNIX socketpair → BlockingVsockTransport. | ||
| // Rapid connect/teardown of these fds stalls the tokio kqueue reactor's | ||
| // timer wheel, so neither tokio::time::sleep nor tokio::time::timeout | ||
| // can be used reliably inside this loop. spawn_blocking isolates the | ||
| // probe from the async runtime entirely. | ||
| let probe_result = tokio::task::spawn_blocking(move || { | ||
| let deadline = std::time::Instant::now() + timeout; | ||
| let poll_interval = Duration::from_millis(100); | ||
|
|
||
| while std::time::Instant::now() < deadline { | ||
| // Console output (best-effort, non-blocking). | ||
| #[cfg(target_os = "macos")] | ||
| if let Ok(output) = mm.read_console_output(DEFAULT_MACHINE_NAME) { | ||
| let trimmed = output.trim_matches('\0'); | ||
| if !trimmed.is_empty() { | ||
| tracing::info!("Guest console: {}", trimmed.trim_end()); | ||
| tracing::info!("{}", trimmed.trim_end()); | ||
| } | ||
| } | ||
| Err(e) => { | ||
| tracing::debug!("Console read failed: {}", e); | ||
| } | ||
| } | ||
|
|
||
| // Try to connect to agent | ||
| match self.machine_manager.connect_agent(DEFAULT_MACHINE_NAME) { | ||
| Ok(mut agent) => { | ||
| // Try to ping agent | ||
| match agent.ping().await { | ||
| Ok(_response) => { | ||
| tracing::info!("Agent is ready"); | ||
| self.health_monitor.record_success(); | ||
| #[cfg(target_os = "macos")] | ||
| { | ||
| // Spawn a single adaptive read loop for both serial ports. | ||
| // Uses exponential backoff (100ms active → 1600ms idle) | ||
| // to minimize wakeups when no output is being produced. | ||
| let mm = Arc::clone(&self.machine_manager); | ||
| tokio::spawn(serial::serial_read_adaptive(mm)); | ||
| } | ||
| return Ok(()); | ||
| } | ||
| Err(e) => { | ||
| tracing::debug!("Agent ping failed: {}", e); | ||
| } | ||
| } | ||
| } | ||
| Err(e) => { | ||
| tracing::debug!("Agent connection failed: {}", e); | ||
| // connect_agent → AF_UNIX detected → BlockingVsockTransport. | ||
| // ping_blocking uses libc::poll with 5s deadline — no tokio. | ||
| match mm.connect_agent(DEFAULT_MACHINE_NAME) { | ||
| Ok(mut agent) => match agent.ping_blocking() { | ||
| Ok(_) => return Ok(()), | ||
| Err(e) => tracing::debug!("Agent ping failed: {e}"), | ||
| }, | ||
| Err(e) => tracing::debug!("Agent connection failed: {e}"), | ||
| } |
There was a problem hiding this comment.
wait_for_agent’s probe loop always calls agent.ping_blocking(), but ping_blocking() errors on async transports. If the VM is running with a non-HV backend (VZ on macOS) or any async transport, this loop will never succeed and will time out. Consider branching: use blocking probe only when agent.is_blocking() (AF_UNIX/HV) and keep the original async probe for async transports.
| for (host_ip_str, host_port, container_port, protocol) in bindings { | ||
| let proto = match protocol.to_lowercase().as_str() { | ||
| "udp" => InboundProtocol::Udp, | ||
| _ => InboundProtocol::Tcp, | ||
| }; | ||
| // UDP stays on the legacy L2 injection path for now. | ||
| if protocol.eq_ignore_ascii_case("udp") { | ||
| tracing::debug!( | ||
| "Skipping UDP port forward {}:{} (vsock is TCP-only)", | ||
| host_ip_str, | ||
| host_port, | ||
| ); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Bindings with protocol "udp" are currently skipped with a comment saying they “stay on the legacy L2 injection path”, but there is no fallback implementation here anymore. This effectively disables UDP port publishing on macOS without surfacing an error to the caller. Either implement the legacy UDP path (or an alternative), or return an explicit error/diagnostic when UDP bindings are requested.
| fn send_to_guest( | ||
| frame_sink: Option<&std::sync::Arc<dyn crate::direct_rx::FrameSink>>, | ||
| guest_async: &AsyncFd<FdWrapper>, | ||
| frame_data: &[u8], | ||
| write_queue: &mut VecDeque<FrameBuf>, | ||
| ) { | ||
| if let Some(sink) = frame_sink { | ||
| let _ = sink.send(frame_data.to_vec()); | ||
| return; | ||
| } |
There was a problem hiding this comment.
send_to_guest ignores the boolean result of FrameSink::send(). When the channel is full, frames are silently dropped with no logging/metrics and no fallback, which can break protocols that aren’t self-retransmitting (e.g. DHCP/DNS) and makes diagnosing drops harder. Consider handling false (at least log/track a drop counter, and/or fall back to the socketpair path when available).
| // Phase 3: Drain channel frames (smoltcp/DHCP/DNS). | ||
| // Use the remaining coalescing timeout after inline polling. | ||
| let elapsed = loop_start.elapsed(); | ||
| let remaining = COALESCE_TIMEOUT.saturating_sub(elapsed); | ||
|
|
||
| while (batch as usize) < BATCH_SIZE { | ||
| // Use the remaining timeout for the first recv, then zero | ||
| // for subsequent ones to drain without blocking. | ||
| let timeout = if batch == 0 && inline_conns.is_empty() { | ||
| // No inline conns and nothing batched yet — block for | ||
| // the full coalescing timeout. | ||
| COALESCE_TIMEOUT | ||
| } else if remaining.is_zero() { | ||
| // Timeout already consumed by inline polling — try_recv only. | ||
| Duration::ZERO | ||
| } else { | ||
| remaining | ||
| }; |
There was a problem hiding this comment.
The coalescing timeout logic doesn’t match the comment: remaining is computed once before the loop and then reused for every recv_timeout, so after the first received frame the loop may still block up to remaining repeatedly (potentially exceeding the intended COALESCE_TIMEOUT window by a large factor). Consider computing an absolute deadline and recomputing the remaining duration each iteration, or using Duration::ZERO after the first successful recv to drain without blocking.
| let avail_idx = | ||
| u16::from_le_bytes([memory[avail_addr + 2], memory[avail_addr + 3]]) as usize; | ||
|
|
||
| let mut current = self.last_avail; | ||
| let mut completions = Vec::new(); | ||
|
|
||
| while current != avail_idx { | ||
| let ring_off = avail_addr + 4 + 2 * (current % q_size); |
There was a problem hiding this comment.
Virtqueue indices are u16 and wrap. Here avail_idx is read as u16 but immediately cast to usize, and current/last_avail are tracked as usize with current += 1. When avail_idx wraps (e.g. 65535→0), while current != avail_idx will never terminate. Track indices as u16 and advance with wrapping_add(1) (and do ring offsets using (idx % q_size) after casting).
| let avail_idx = | ||
| u16::from_le_bytes([memory[avail_addr + 2], memory[avail_addr + 3]]) as usize; | ||
|
|
||
| // Read used index. | ||
| if used_addr + 4 > memory.len() { | ||
| return Ok(Vec::new()); | ||
| } | ||
| let used_idx_ref = &memory[used_addr + 2..used_addr + 4]; | ||
| let mut used_idx = u16::from_le_bytes([used_idx_ref[0], used_idx_ref[1]]) as usize; | ||
|
|
||
| let mut completions = Vec::new(); | ||
|
|
||
| // Process available descriptors. | ||
| while used_idx != avail_idx { | ||
| let avail_ring_off = avail_addr + 4 + (used_idx % queue_size) * 2; | ||
| if avail_ring_off + 2 > memory.len() { | ||
| break; | ||
| } | ||
| let head_idx = u16::from_le_bytes([memory[avail_ring_off], memory[avail_ring_off + 1]]); |
There was a problem hiding this comment.
Virtqueue avail_idx/used_idx wrap at u16. Converting them to usize and looping with while used_idx != avail_idx { used_idx += 1; } can become non-terminating after wraparound (e.g. used=65535, avail=0). Use u16 indices with wrapping_add(1) and only cast to usize for modulo/address calculations.
| /// macOS: forward TCP ports via vsock pipe to the guest agent. | ||
| /// | ||
| /// Each accepted host TCP connection opens a vsock channel to the | ||
| /// guest port-forward proxy (port 1025), which connects to the | ||
| /// target and relays bidirectionally. No virtio-net frames, no | ||
| /// smoltcp, no header construction. |
There was a problem hiding this comment.
There’s now a mix of documentation around macOS port forwarding: this function uses the vsock relay, but the public start_port_forwarding_for doc comment above still describes the old InboundListenerManager/L2 injection approach. Please update the public docs to match the new macOS implementation so callers aren’t misled.
Archive: vsock-based TCP Port Forwarding
This branch preserves the vsock port-forwarding exploration (18 commits,
7ad8cb6..1fd587dplus a final diagnostic wip). Abandoned in favor ofreturning to the virtio-net path after confirming vsock's protocol
characteristics are a fundamental mismatch for the data-plane throughput
target (>50 Gbps).
The baseline we're returning to (
6075f46, inline vhost + GSO NEEDS_CSUM)sustains 10.4 Gbps on virtio-net. Every vsock variant attempted here
stalled or underperformed.
Why vsock was tried
The initial motivation (
7ad8cb6) was to eliminate the double TCP stackoverhead:
host TCP → smoltcp → virtio-net frame → guest TCPbecomeshost TCP → vsock → guest TCP. On paper this is cheaper. In practice itintroduced harder problems than it removed.
Architecture attempts (in order)
7ad8cb6) — naive TcpListener → vsock relay18fa513) — split TX (vCPU thread) / RX (kqueue thread)22dfc0e) — bypass host-side virtio layer1cc3dbf) — zero-copy fast path36fd19e/1fd587d) — merged TX+RX in single kqueue loop (OrbStack/crosvm pattern)None delivered sustained high throughput without stalls.
Root causes discovered
1. vsock has no NAPI budget
Linux
virtio_transport_rx_workloops until the RX queue is empty withno cooperative yield. Under sustained injection the kthread monopolizes
a CPU, starves RCU, and triggers stall warnings. Patched in-tree kernel
with a
VIRTIO_VSOCK_RX_BUDGET=64quirk (seearcbox-kernelv0.0.15) —this kept the guest alive but did not fix throughput.
2. TCP backpressure stall at ~13 GB
kern.ipc.maxsockbufto 16 MB + 8 MB socketpair delayed butdid not eliminate the stall.
3. Credit-update propagation latency (split architecture)
vCPU writes
peer_fwd_cntthroughArc<Mutex<VsockConnectionManager>>;RX worker reads it on next kqueue poll (up to 1 ms later). Workaround
was
INJECT_YIELDpacing, which limited throughput. The muxer rewritesolved the latency but exposed other problems.
4. EVENT_IDX suppression traps
Wrong
avail_eventvalue repeatedly caused missed kicks. Tried:(current_avail + 1)— wrong memory ordering, appeared to work briefly0— WRONG,vring_need_event(0, new, old)suppresses after firstcurrent_avail.wrapping_sub(1)— WRONGavail_idxfrom guest memory, write back asavail_eventwithReleasefence.5.
poll_vsock_rxflooding the kick pipeCalled on every vCPU MMIO exit (thousands/sec), each wrote a byte to
the muxer's kick pipe → 150 Mbps throughput cap. Fixed by making
poll_vsock_rxearly-return when the muxer thread is active.6. kqueue rescan overwriting event array
Zero-timeout rescan after TX processing replaced
nevwith the rescanreturn count, discarding pending socketpair readiness events. Removed —
level-triggered EVFILT_READ fires on the next main poll anyway.
7. Credit arithmetic bugs
peer_avail_creditused uncheckedu32subtraction. Transientout-of-order
peer_fwd_cnt > rx_cntbriefly underflowed to a hugevalue, causing absurd read sizes (
c4c01f3,19f7f83).OP_CREDIT_REQUEST(0194559).rx_cnt/used_idxinconsistent(
f0ec520).Pitfalls / lessons learned
vsock is a control-plane protocol, not a data-plane one. Serialized
through one virtqueue with a rigid credit model and no NAPI budgeting.
virtio-nethas been optimized for high-throughput streaming fordecades; don't try to reinvent it.
Magic numbers are a symptom of not understanding the invariant.
Every
q_size/4,INJECT_YIELD_INTERVAL,CYCLE_BYTE_CAPI addedwas papering over a real bug. The true invariant (
q_free == 0forbackpressure) is always more informative.
Split vCPU-thread + worker-thread +
Arc<Mutex<>>architectures haveinherent latency (= worst-case poll interval). Unified muxer
(crosvm/cloud-hypervisor/OrbStack) is not a style preference — it's
required for correctness when credit updates must be immediate.
Sustained tests surface what one-shot tests hide. The 13 GB stall
takes ~14 s of full-rate traffic to manifest.
iperf3 -t 30passed;iperf3 -t 60failed. Always run long tests.macOS sysctl hard limits matter.
kern.ipc.maxsockbufcapsSO_SNDBUF/SO_RCVBUF silently. Default 8 MB, max 16 MB. Setsockopt
above the limit does not error — it clamps.
Memory barriers are load-bearing on ARM64. Missing
Release/Acquirefences aroundavail_idx/used_idxled to sporadicmissed updates that only reproduced on Apple Silicon. Hypervisor.framework
does not automatically synchronize.
kqueue level-triggered semantics don't need "rescan". Registered
fds with pending data fire on the next
kevent(). Zero-timeoutrescans just cause subtle bugs when the return count overwrites the
original event array.
The guest keeps pushing TX for connections the host already closed.
Credit updates, shutdown packets, stragglers — saw 304 stale TX
descriptors in one batch. Muxer must gracefully drop these.
Test harness
&&chains hide failures.docker rm -f container && docker run ...— ifrmfails because there's no container, therunnever executes. Use;or|| truebetween setup steps.Files of interest (for future reference)
virt/arcbox-vmm/src/vsock_muxer.rsvirt/arcbox-vmm/src/vsock_rx_worker.rsvirt/arcbox-vmm/src/vsock_manager.rsvirt/arcbox-port-forward/src/forwarder.rs../arcbox-kernel/patches/0001-vsock-rx-budget.shStatus
feat/custom-vmm-phase2has been reset to6075f46(virtio-netbaseline) for continued work.