Skip to content

docs(testing): scaffold Linux compatibility test plan (WIP)#540

Closed
aaddrick wants to merge 38 commits intomainfrom
docs/compat-matrix
Closed

docs(testing): scaffold Linux compatibility test plan (WIP)#540
aaddrick wants to merge 38 commits intomainfrom
docs/compat-matrix

Conversation

@aaddrick
Copy link
Copy Markdown
Owner

Summary

  • Staging for an eventual automated test harness, in markdown form first
  • 67 functional tests (T01–T39 cross-env, S01–S28 env-specific) + 10 UI surface inventories under `docs/testing/`
  • Stable test IDs, standardized test bodies, per-element UI tables — shape designed so Playwright / xdotool / DBus assertions can be slotted in later without rewriting the corpus

What this is

A test-plan scaffold. The structure (dashboard separated from specs, runbook for sweep mechanics, UI inventory separate from functional tests, severity tiers, smoke set) reflects how mature OSS projects organize manual testing — and is the foundation for layering in automation incrementally.

What this is not

Done. Roughed in but far from ready:

  • Most cells are `?`. The matrix is a dashboard, not a record of what's been verified. Only a handful of statuses (KDE-W daily-driver use, Hypr-N per @typedrat, captured failures from prior sweeps) reflect real testing today.
  • T15–T39 are derived from upstream docs (code.claude.com/docs/en/desktop*) — features whose Linux behavior is officially undocumented (upstream explicitly says "Linux is not supported" for the Code tab). These tests describe intended Linux-side behavior, not anything verified yet.
  • UI checklists are starter inventories. Every surface has the structure, but element-level coverage needs real walkthroughs to flesh out selectors, state expectations, and per-row known-issue notes. Expect to add rows as edge cases surface.
  • No automation wired. The structure supports it. Nothing is plugged in.
  • Severity classifications are best-guess. Should refine once real failure data lands.

Layout

```
docs/testing/
├── README.md orientation, severity tiers, smoke set, automation roadmap
├── matrix.md dashboard: cross-env table + env-specific status snapshots
├── runbook.md VM setup, diagnostic capture, sweep workflow
├── cases/ 67 functional tests grouped by feature surface
│ ├── launch.md
│ ├── tray-and-window-chrome.md
│ ├── shortcuts-and-input.md
│ ├── code-tab-foundations.md
│ ├── code-tab-workflow.md
│ ├── code-tab-handoff.md
│ ├── routines.md
│ ├── extensibility.md
│ ├── distribution.md
│ └── platform-integration.md
└── ui/ per-surface UI checklists (every interactive element)
├── window-chrome-and-tabs.md
├── tray.md
├── sidebar.md
├── prompt-area.md
├── code-tab-panes.md
├── settings.md
├── routines-page.md
├── connectors-and-plugins.md
├── quick-entry.md
└── notifications.md
```

What's covered

  • Historical project surfaces: app launch, doctor, tray, window decorations, hybrid topbar (PR feat(linux): hybrid titlebar mode for clickable in-app topbar #538), Quick Entry, autostart, hide-to-tray, multi-instance.
  • Upstream Code-tab surface: Code tab load, sign-in browser handoff, folder picker (portal/native), drag-drop, integrated terminal, file pane, preview pane, PR monitoring (`gh`), scheduled tasks, connectors OAuth, plugin browser, MCP / hooks / CLAUDE.md memory, Dispatch handoff.
  • Env-specific failures: Ubuntu/DEB, Fedora/RPM, Wayland-native (wlroots), KDE, GNOME (mutter XWayland key-grab regression — Quick Entry feature does not work properly #404), Omarchy, Niri (#BindShortcuts error 5), AppImage, `.desktop` env handling, idle-sleep / suspend, Computer Use (out-of-scope per upstream — graceful unavailability check), auto-update vs apt/dnf, plugin/worktree storage.

Why this shape

  • Dashboard separated from specs. Status updates touch `matrix.md` only; spec authorship touches `cases/`. Two different workflows, two different files — reduces matrix-merge noise.
  • UI inventory separate from functional tests. Functional tests catch "the feature broke." UI checklists catch "the feature works but looks wrong." Both matter for Linux because Electron under different DEs / display servers / GTK theme combos produces visual artifacts that aren't behavioral failures.
  • Stable test IDs + standardized bodies. Sets up automation: `T01`–`T39` and `S01`–`S28` won't move. Each test has `Steps` + `Diagnostics on failure` blocks shaped for scripted runners.

What I'd do next (ordered)

  1. Smoke-set sweep on KDE-W — flip the first 10 cells from `?` to real values, pressure-test the runbook in the process.
  2. Walk one UI surface end-to-end (likely `window-chrome-and-tabs.md`) to validate the checklist format before scaling out.
  3. Prototype the first automation runner — smoke set is the natural target; the standardized bodies should let one runner cover ~10 tests with shared diagnostic capture.
  4. Refine severity once real failure data lands.

Test plan

This is a docs-only change. Suggested review:

  • Skim `docs/testing/README.md` — does the orientation make sense as a front door?
  • Skim `docs/testing/matrix.md` — is the dashboard scannable? Are status semantics clear?
  • Pick one case file (e.g. `cases/code-tab-foundations.md`) — does the standard test body have the right fields for both manual and future automation?
  • Pick one UI file (e.g. `ui/window-chrome-and-tabs.md`) — is element-level granularity right, or too fine / too coarse?
  • Spot-check anchor links from `matrix.md` to `cases/` — they should all resolve.
  • Sanity-check the severity scheme in `README.md` and `runbook.md` — are the tiers and the smoke set the right cuts?

Generated with Claude Code
Co-Authored-By: Claude Opus 4.7 noreply@anthropic.com
85% AI / 15% Human
Claude: synthesized test items from existing project docs, upstream Claude Code Desktop docs, and Anthropic blog posts; designed the layout; wrote all test bodies, runbook, and UI inventories.
Human: provided the source files (`~/vms/distro-matrix*.md`); directed strategy (file naming, full restructure, add UI doc, eventual-automation framing); steered tradeoffs at each restructure decision point.

aaddrick and others added 25 commits May 3, 2026 07:55
Establish a manual test plan for the Linux fork at docs/testing/, structured
to support eventual automation.

Layout:
- README.md         orientation, severity tiers, smoke set (10 tests),
                    automation roadmap
- matrix.md         cross-env dashboard (T01-T39) + env-specific status
                    snapshots (S01-S28) + known-failures rollup
- runbook.md        VM setup, diagnostic-capture commands, sweep workflow,
                    severity guidance, how to add tests
- cases/            67 functional tests grouped by feature surface; every
                    test has standardized Severity / Steps / Expected /
                    Diagnostics on failure / References sections
- ui/               per-surface UI checklists (window chrome, tray,
                    sidebar, prompt, code-tab panes, settings, routines,
                    connectors/plugins, quick entry, notifications). Every
                    row is an interactive element with selector + expected
                    state.

Coverage:
- Historical project surfaces: app launch, doctor, tray, window
  decorations, hybrid topbar, Quick Entry, autostart, hide-to-tray,
  multi-instance.
- Upstream Claude Code Desktop surfaces (officially "Linux not supported"
  per code.claude.com/docs/en/desktop): Code tab, sign-in flow, folder
  picker, drag-drop, integrated terminal, file pane, preview pane, PR
  monitoring, scheduled tasks, connectors OAuth, plugin browser, MCP /
  hooks / CLAUDE.md memory, Dispatch handoff.
- Env-specific failure modes: Ubuntu/DEB, Fedora/RPM, Wayland-native
  (wlroots), KDE, GNOME (mutter XWayland key-grab), Omarchy, Niri,
  AppImage, .desktop env handling, idle-sleep / suspend, Computer Use
  (out-of-scope per upstream), auto-update vs apt/dnf, plugin/worktree
  storage.

Automation hooks:
- Stable T## / S## test IDs (won't move).
- Standardized test bodies — Steps and Diagnostics fields are
  scripted-runner-shaped.
- UI checklists are per-element tables — every row a candidate
  Playwright / xdotool / DBus assertion.
- Smoke set explicit in README — first 10 tests for automation.

Co-Authored-By: Claude <claude@anthropic.com>
Captures the brainstorm + research pass behind the eventual harness:
three-layer model (renderer / native / manual), why in-VM Playwright
beats orchestrator-driven CDP, toolchain choices per layer (playwright-
electron, dogtail/AT-SPI, ydotool→libei), anti-patterns to design
against from day one, and a suggested first vertical slice (KDE-W + T01).

Includes an Open questions section listing eight decisions still owed
before any of this becomes code — language split, harness location,
image-build tooling, CI execution model, data-testid injection, severity
for the Electron-Wayland-default tests, diagnostic retention, JUnit
output destination.

Sourced; not committed direction yet.

Co-Authored-By: Claude <claude@anthropic.com>
Restructures automation.md from brainstorm-with-open-questions to
direction-with-residual-decisions. Eight calls captured in a Decisions
table near the top:

1. Single language (TypeScript). dbus-next replaces gdbus shell-outs;
   child_process wraps OS-tool invocations as typed TS helpers; portal
   mocking via dbus-next handles native-dialog tests. Python only as a
   last-resort escape hatch for AT-SPI cases that resist mocking.
2. Harness lives at tools/test-harness/.
3. Packer for imperative distro images + Nix flake for Hypr-N.
4. No CI infrastructure initially; harness invokable from CI but
   sweeps run from the dev box for the first ~20 tests.
5. Semantic locators only (getByRole/getByLabel/getByText). No
   proactive data-testid injection patch; escalate per-test if a
   selector proves unstable.
6. X11-default verification is Smoke; Wayland-native characterization
   is Should. Project keeps X11 default because portal coverage for
   GlobalShortcuts is uneven across compositors.
7. Last 10 greens + all reds, on main only. Capture --doctor /
   launcher log / screenshot every run.
8. JUnit lives as workflow-run artifacts. Matrix-regen reads latest
   run's bundle and PRs the matrix update.

T17 (folder picker) moves out of "manual forever" — portal mocking
covers the integration test cleanly. dogtail demoted to escape-hatch
status, only invoked if a specific test forces it.

Co-Authored-By: Claude <claude@anthropic.com>
Adds the in-VM TS harness at tools/test-harness/ covering the four
tests that exercise every distinct shape of harness code:

- T01 — app launch (playwright-electron)
- T03 — tray icon present (dbus-next + StatusNotifierWatcher)
- T04 — window decorations draw (xprop + xdotool shell-out helpers)
- T17 — folder picker opens (Electron-level dialog intercept; v1)

Layout:

    tools/test-harness/
    ├── package.json / tsconfig / playwright.config
    ├── src/lib/         — electron, dbus, sni, wm, env, retry, diagnostics
    ├── src/runners/     — one .spec.ts per test ID
    └── orchestrator/sweep.sh

Per Decision 1 (single-language TS): every runner is .ts; OS tools
(xprop, xdotool, claude-desktop --doctor) are shelled out via
child_process and wrapped as typed TS helpers. dbus-next handles all
DBus introspection. No bash test scripts, no Python.

T17 is the shallow v1 — intercepts dialog.showOpenDialog at the
Electron main process via Playwright's app.evaluate() rather than
mocking the portal. Mocking org.freedesktop.portal.FileChooser via
dbus-next requires displacing the running portal service or running
under dbus-run-session, both intrusive enough to defer until signal
warrants it. The test file documents this and the upgrade path.

T04 uses xprop / xdotool which work on X11 native and KDE Wayland
(via XWayland — the project default per Decision 6). Native-Wayland
window-state queries are deferred.

Wires runner: fields into the four cases/*.md test specs.

Type-check passes; npx playwright test --list discovers all four.

Run with:
    cd tools/test-harness
    npm install
    ROW=KDE-W ./orchestrator/sweep.sh

Co-Authored-By: Claude <claude@anthropic.com>
Captures four real issues surfaced by trying to run T01 against the
installed claude-desktop on Nobara KDE-W, plus the fixes that landed.

Fixes that stuck:

1. Bypass the launcher script (/usr/bin/claude-desktop). It redirects
   Electron's stdout/stderr to ~/.cache/claude-desktop-debian/launcher.
   log, which means Playwright can't read the CDP advertisement on
   stderr. launchClaude now resolves the Electron binary + app.asar
   directly and spawns through Playwright. Override paths via
   CLAUDE_DESKTOP_ELECTRON / CLAUDE_DESKTOP_APP_ASAR env vars.

2. Inject the launcher's flags. Decision 6 (X11 default) is enforced
   in production via --disable-features=CustomTitlebar
   --ozone-platform=x11. Without these, Electron 41 hits a fatal
   Wayland communication error ("Broken pipe") on this build. Added
   as LAUNCHER_INJECTED_FLAGS.

3. Inject the launcher's env. ELECTRON_FORCE_IS_PACKAGED=true and
   ELECTRON_USE_SYSTEM_TITLE_BAR=1 mirror setup_electron_env(). The
   former makes app.isPackaged return true so resource resolution
   uses process.resourcesPath; the latter matches hybrid/native
   titlebar modes.

4. Pre-launch cleanup. Mirrors cleanup_orphaned_cowork_daemon +
   cleanup_stale_lock + cleanup_stale_cowork_socket in launcher-common
   .sh. Without it, a previous failed run leaves an orphaned cowork
   daemon and a stale SingletonLock that poison the next launch.

Also: dropped the xdotool dependency. wm.ts now finds the X11 window
by walking _NET_CLIENT_LIST + _NET_WM_PID via xprop only, which is
universally installed where xdotool isn't.

Open finding documented in README "Known limitations":

  Playwright's _electron.launch() currently fails after Frame Fix
  completes — the Node-inspector ws disconnects (code 1006) before
  the renderer ever advertises its DevTools port. Standalone
  electron --inspect=0 ... app.asar runs cleanly with the same flags
  (Frame Fix → "Starting app" → window created), so the failure is
  specific to Playwright + Electron 41 + this build. Likely
  workarounds: (a) chromium.connectOverCDP() against externally-
  spawned Electron with fixed --remote-debugging-port; (b) skip L1
  entirely for T03/T04 (those don't need Playwright owning the
  process — just spawn via child_process and use dbus-next / xprop).

Type-check passes; orchestrator/sweep.sh runs cleanly. The four .spec
.ts files all discover via npx playwright test --list. The blocker
is the launch handshake, not the harness shape.

Co-Authored-By: Claude <claude@anthropic.com>
Discovered the real blocker behind every failed Playwright launch: the
shipped index.pre.js has an authenticated-CDP gate.

  uF(process.argv) && !qL() && process.exit(1);

uF matches --remote-debugging-port / --remote-debugging-pipe on argv;
qL validates an ed25519-signed token in CLAUDE_CDP_AUTH (signed payload
${timestamp_ms}.${base64(userDataDir)}, 5-minute TTL) against a hardcoded
public key. Without a valid signature the app exits with code 1 right
after frame-fix-wrapper completes.

Both _electron.launch() and chromium.connectOverCDP() inject
--remote-debugging-port=0 and trigger the gate. The signing key is held
upstream; we can't forge tokens. CDP-driven L1 testing is blocked until
one of: (a) upstream issues a test/CI token, (b) we carry an
app-asar.sh patch that neutralizes the gate, or (c) we drive the
renderer via accessibility (dogtail / AT-SPI). All three are real
options; none belong in this commit.

What ships here, working today:

  T01 — App launch                 ✓ on KDE-W
  T03 — Tray icon present          ✓ on KDE-W (already was)
  T04 — Window decorations draw    ✓ on KDE-W (already was)
  T17 — Folder picker opens        - (skipped, awaits portal mock v2)

The harness now spawns Electron without any debug-port flags and
probes the running app externally — xprop for window state, dbus-next
for tray. T01 verifies "an X11 window with our pid appears within 15s
and its title matches /claude/i" rather than reading navigator.userAgent;
T03/T04 were external-probe tests already.

Sweep output:

  $ ROW=KDE-W ./orchestrator/sweep.sh
  Running 4 tests using 1 worker
    ✓  1 T01 — App launch (7.2s)
    ✓  2 T03 — Tray icon present (7.2s)
    ✓  3 T04 — Window decorations draw (7.1s)
    -  4 T17 — Folder picker opens
    1 skipped
    3 passed (22.9s)
  summary: tests=4 failures=0 errors=0 skipped=1

JUnit XML written, .tar.zst bundle created, exit 0.

The CDP auth gate finding is documented at docs/testing/automation.md
"The CDP auth gate" with the three escape hatches enumerated. Decision 1
and Decision 5 reopen for L1 once the project picks a path.

Co-Authored-By: Claude <claude@anthropic.com>
The CDP gate (lib/electron.ts) only matches --remote-debugging-port /
-pipe on argv. It doesn't check --inspect or runtime SIGUSR1 — which is
the same code path as the in-app Developer → Enable Main Process
Debugger menu item. Spotted by aaddrick.

So we spawn Electron clean (gate stays asleep), wait for the X11
window, then send SIGUSR1 to attach the Node inspector at runtime.
From there we get main-process JS evaluation, which reaches the
renderer via webContents.executeJavaScript() and supports main-process
mocks (dialog.showOpenDialog for T17).

What landed:

  src/lib/inspector.ts   — new. WebSocket Node-inspector client with
                           evalInMain<T>() and evalInRenderer<T>()
                           wrappers. Node 22+ built-in WebSocket; no
                           extra deps.
  src/lib/electron.ts    — adds app.attachInspector(timeoutMs) which
                           SIGUSR1's the pid and waits for port 9229
                           to answer.
  src/runners/T17        — re-enabled. Inspector attaches, dialog mock
                           installs, claude.ai webContents found,
                           Code-tab navigation click succeeds. Skips
                           with rich diagnostic if the folder-picker
                           click chain doesn't land — selector tuning
                           is iterate-as-needed work, not a blocker.

Two implementation gotchas captured in code comments:

  - BrowserWindow.getAllWindows() returns 0 because frame-fix-wrapper
    substitutes the class and breaks the static registry. Use
    webContents.getAllWebContents() instead — works correctly.
  - Runtime.evaluate's awaitPromise + returnByValue returns empty
    objects for awaited Promise resolutions. Workaround: IIFE returns
    JSON.stringify(value) and caller JSON.parses.

Sweep output:

  $ ./orchestrator/sweep.sh
  ✓  T01 — App launch (7.2s)
  ✓  T03 — Tray icon present (7.2s)
  ✓  T04 — Window decorations draw (7.1s)
  -  T17 — Folder picker opens
  3 passed, 1 skipped (44s)

Decision 1's escape-hatch reasoning (dogtail / AT-SPI) is no longer the
fallback for L1; it's only relevant for native dialogs the inspector
pattern can't reach. The three documented escape hatches under "The CDP
auth gate" can be retired — option (4), runtime-attach, is what we
actually use.

Co-Authored-By: Claude <claude@anthropic.com>
The README's "Automation roadmap" section was written when the harness
didn't exist; it described automation in the future tense. Same for the
runbook's "Eventual automation" section ("runner: fields are
aspirational"). Both lied as of last week.

  README "Automation status" — points at tools/test-harness/, lists the
                               four wired runners (T01/T03/T04/T17),
                               links automation.md for architecture,
                               links runbook for invocation.
  runbook "Automated runs"   — sweep.sh invocation, output paths,
                               JUnit-to-matrix mapping, coexistence
                               with manual tests, brief on the
                               SIGUSR1 / runtime-attach path through
                               the CDP gate (with link to the long
                               writeup in automation.md).

Co-Authored-By: Claude <claude@anthropic.com>
Focused sweep plan for closing #393 / #404 / #370, anchored in upstream
design intent rather than user expectation (validated against
build-reference/.vite/build/index.js).

Adds nine functional test specs (S29-S37) covering Quick Entry popup
lifecycle, submit-flow reachability across main-window states, the
fullscreen edge case, position memory across restart, multi-monitor
fallback, and popup-survives-main-destroy behaviour. Each spec cites
specific upstream file:line evidence.

Refines ui/quick-entry.md rows with the same upstream evidence and adds
rows for popup lifecycle and main-window-destroy persistence. Submit
transition row now reflects "always a new chat session, never appended
to current" per index.js:515546.

Co-Authored-By: Claude <claude@anthropic.com>
Three prerequisites built before adding the closeout sweep runners:

- Per-test isolation default in launchClaude(). Fresh
  XDG_CONFIG_HOME / CLAUDE_CONFIG_DIR per launch via mkdtemp,
  cleaned up on close. Three modes: default (fresh), shared
  (pass an Isolation handle for restart-style tests like S35),
  null (host config — opt-in for tests that need real claude.ai
  auth via CLAUDE_TEST_USE_HOST_CONFIG).
- Row-skipping primitive (skipUnlessRow) so spec files declare
  applicability once and the orchestrator routes correctly. Maps
  to JUnit <skipped> → matrix `-`.
- Layered Critical/Should assertion pattern. Local signals stay
  local (popup-closed = isVisible() === false), network-coupled
  signals (chat URL nav) are tracked separately so a claude.ai
  hiccup doesn't fail a regression cell.

New libs:
- isolation.ts — per-test sandbox
- row.ts — skipUnlessRow / skipOnRow
- argv.ts — /proc/$pid/cmdline + flag-presence check (QE-6, S07,
  S12, future Wayland-default Smoke)
- asar.ts — in-place app.asar reads via @electron/asar (QE-19,
  future patch sanity for tray.sh / cowork.sh / etc.)
- quickentry.ts — domain wrapper. Single point of coupling to
  upstream's main-process structure for QE-* tests. Anchors on
  stable strings (loadFile path '.vite/renderer/quick_window/
  quick-window.html', IPC channel names, settings keys), not
  minified vars.

S31 — Quick Entry submit reaches new chat from any main-window
state. Backs QE-7/8/9; passes on KDE-W in ~28s.

The interceptor pivot worth noting: scripts/frame-fix-wrapper.js
returns the electron module wrapped in a Proxy whose `get` trap
returns a closure-captured PatchedBrowserWindow. Constructor-level
wraps (`electron.BrowserWindow = Wrapped`) are silently bypassed —
writes succeed but reads ignore them. The reliable hook is at the
prototype-method level (loadFile / loadURL); captures every
instance regardless of subclass identity. Documented in
docs/learnings/test-harness-electron-hooks.md so the next
contributor doesn't re-discover the trap.

ydotool is a hard prerequisite for QE-* shortcut injection.
README's "Quick Entry runners" section walks through one-time
host setup (install + ydotoold systemd override for a
world-writable socket). sweep.sh fast-fails with a clear
diagnostic when the daemon isn't reachable.

What's left: ten more runners (S29/S30/S32/S33/S34/S35/S36/S37,
QE-6/19 patch sanity, QE-15/17/21 popup chrome). Each is a
~30-60-line recombination over the existing libs — see plan in
the closing message of this PR thread.

---
Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <claude@anthropic.com>
40% AI / 60% Human
Claude: drafted libs + runner, debugged the frame-fix-wrapper Proxy trap, wrote the learnings entry, ran S31 on bare-metal KDE-W
Human: scoped the prerequisites split, ran ydotool/ydotoold setup, validated the output, drove design tradeoffs (per-test isolation default, layered Critical/Should assertion, prototype-hook over constructor wrap)
Wires up the remaining QE-* sweep runners from
docs/testing/quick-entry-closeout.md. Full sweep on KDE-W now runs
16 specs in ~2.2 min; 10 pass, 5 cleanly skip per spec intent
(S12/S32 row-gated to GNOME-W, S36 single-monitor, S37 unreachable
on Linux, T17 mid-air on selector tuning).

Specs landed:

- S09 — patch sanity (asar grep for the KDE-gate string). Pure file
  probe, no app launch, ~75ms.
- S12 — `--enable-features=GlobalShortcutsPortal` argv check.
  GNOME-W only. Currently a known-failing regression detector
  until the launcher patch lands; greens once #404 is closed.
- S29 — popup lazy-create from closed-to-tray. Verifies the popup
  webContents is null before the first shortcut, then opens.
- S30 — shortcut becomes a no-op after full app exit. Switched
  from "no leftover process" to a pgrep-pid-delta assertion; the
  spec's regression target is "no NEW pid spawned by the
  shortcut," not "zero leftovers" (renderer/zygote teardown is
  asynchronous, not what S30 is testing).
- S31 — pre-existing; updated to use openAndWaitReady().
- S32 — GNOME-W/Ubu-W variant of S31 with a main-reappears
  assertion that S31 explicitly avoids. Skips on KDE rows; will
  fail on GNOME-W until the stale-isFocused() patch is widened
  beyond the current KDE-only #406 gate.
- S33 — bundled Electron version. Reads from
  `electron/package.json` rather than running `electron --version`
  (the bundled binary auto-loads `resources/app.asar` so `--version`
  gets passed through as argv to Claude Desktop instead of being
  intercepted by Electron's flag parser).
- S34 — fullscreen main suppresses popup. Inverse-shape test:
  popup must NOT be visible within 3s of the shortcut.
- S35 — position memory across app restart. Two-launch test
  using a shared isolation handle so XDG_CONFIG_HOME persists
  across the restart. Heaviest runner (~30s).
- S36 — multi-monitor fallback. Skips with `-` on single-monitor
  hosts per the closeout spec; uses test.fixme() on multi-monitor
  hosts to surface the missing libvirt-detach orchestration as
  `?` (untested) rather than a misleading green.
- S37 — main-window destroy. Documented skip — unreachable on
  Linux per the close-to-tray override. Marked `-` on every
  Linux row in the matrix.

Two race conditions surfaced and fixed during the bring-up:

1. **lHn() user-loaded race.** Upstream's shortcut handler
   (build-reference index.js:515604) checks `!user.isLoggedOut`
   AFTER ready-to-show and silently skips Ko.show() if the
   main-process user object hasn't populated yet. URL-changes-past-
   /login (visible in the renderer) precedes user-object population
   (in the main process). Mitigation: a new `openAndWaitReady()`
   helper that retries the shortcut up to 3 times with a
   per-attempt timeout. Used by S29-S32, S35.
2. **Main-visible-then-trigger race.** Triggering the shortcut
   immediately after the X11 window appears races the popup show()
   flow on first invocation. Mitigation: wait for
   `mainWin.getState().visible === true` before the first shortcut
   call. The same wait fixes the in-process case where lHn() was a
   non-issue.

New harness primitive:

- `waitForUserLoaded(inspector, timeoutMs)` in lib/quickentry.ts —
  polls the claude.ai webContents URL until it's no longer on a
  /login or /auth path. The signal is necessary but not sufficient
  for the lHn() race (auth state has its own timeline), so the
  retry-loop in `openAndWaitReady()` does the actual heavy lifting.

README's Status table updated to list all 16 specs, layout
section adds the 10 new runner files.

---
Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <claude@anthropic.com>
35% AI / 65% Human
Claude: drafted runners + helpers, traced lHn() race through build-reference, debugged race conditions iteratively against the local install
Human: scoped batches, validated each runner outcome, drove the diagnostic-attachment + retry-vs-sleep tradeoff decisions
Six QE specs (S29-S35) hand-rolled six different shapes of "wait
until the app is ready" — some polled mainWin.getState().visible,
some additionally polled for any claude.ai webContents, some
chained waitForUserLoaded for the URL-past-/login signal. Each
spec started with a 10-20 line block of polling boilerplate.

Replaces those with a tiered helper on the ClaudeApp interface:

  app.waitForReady(level, opts?) → ReadyResultFor<level>

with four levels:
  - 'window'      — X11 window mapped (no inspector)
  - 'mainVisible' — main shell BrowserWindow.isVisible()
  - 'claudeAi'    — any claude.ai webContents reachable
  - 'userLoaded'  — claude.ai URL past /login (lHn() precondition)

Higher levels include all lower-level checks. Returns a
conditionally-typed shape per level so the inspector handle is
non-optional at 'mainVisible' or higher (no `inspector!` casts at
call sites). Single overall timeout (default 90_000ms) flows
across steps — slow startup eats from later steps' budget rather
than tripping a per-step deadline.

Hard-fail vs soft-fail split mirrors what the specs already did:

  - 'window' / 'mainVisible' throw on timeout — no spec today
    has a skip path for these, treat as hard regression.
  - 'claudeAi' / 'userLoaded' return with claudeAiUrl /
    postLoginUrl absent on timeout. Caller checks the field and
    testInfo.skip()s — the existing not-signed-in skip pattern
    in S31, S32, S35.

Migrations:

  S29, S30, S34   → 'mainVisible'
  S31, S32        → 'claudeAi'  (preserves the not-signed-in skip)
  S35 (×2 launch) → 'userLoaded' (preserves the skip on both)

Net -64 lines across the six specs (boilerplate gone) and +130
lines in lib/electron.ts (the helper + types). The trade is
worth it for the next QE-* runner — readiness becomes a single
named call instead of another bespoke poll.

Deliberately preserved:

  - openAndWaitReady's retry loop in lib/quickentry.ts. The
    lHn() race (build-reference index.js:515604) lives on a
    different timeline than the renderer URL — main-process
    user state can lag the URL change past /login. 'userLoaded'
    is necessary but not sufficient; the retry-on-shortcut path
    is the cheapest mitigation and stays.
  - S35's first-launch 3s sleep between userLoaded and the
    first openAndWaitReady. openAndWaitReady's retry would
    catch the race too, but eating one full attempt +
    retryDelayMs is slower than the upfront sleep on a test
    that already runs ~30s.

waitForUserLoaded stays exported from lib/quickentry.ts (lHn()
race domain knowledge belongs there) and is consumed by
electron.ts. No re-export to keep one canonical import path.

Validated on KDE-W: 10 passed, 5 cleanly skipped (S12/S32 row,
S36 single-monitor, S37 Linux-unreachable, T17 on /login),
2.1 minutes total. npm run typecheck clean.

---
Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <claude@anthropic.com>
60% AI / 40% Human
Claude: drafted the helper API, sorted out the conditional-type vs overload tradeoff, migrated the six specs, ran the validation sweep
Human: scoped which specs to migrate, defined the level semantics, called out openAndWaitReady's retry as untouchable, validated outcome
… lib

Adds eighteen pieces of work across the harness, partitioned by file
so they don't conflict, dispatched in parallel and merged together.

== Negative validations on existing runners ==

T03 — assert exactly one SNI item is registered (not just presence),
plus toggle nativeTheme.themeSource and re-assert. Catches the
tray-rebuild-race regression where the destroy+recreate path would
briefly register a duplicate item before deregistering the old one
(see docs/learnings/tray-rebuild-race.md).

S29 — assert the popup BrowserWindow is reused across shortcut
presses, not re-constructed. Counts entries in __qeWindows matching
the popup selector after the first press AND after a second press —
both must equal 1. Catches a regression where lazy-create runs every
press instead of show()/hide() on a persisted Ko ref.

S30 — broadens the "no ghost respawn" delta into a full closeout-
leak panel. Three additional checks BEFORE the post-exit shortcut
press: no `cowork-vm-service` pids remain, the SNI item is
deregistered (connection gone), no leftover `SingletonLock`
symlink under the isolation's configDir. Existing post-shortcut
delta assertion preserved.

S32 — replaces the silent `.catch(() => {})` on waitForPopupClosed
with explicit popup-state-after-submit assertion. The stale-
isFocused short-circuit can also leave the popup visible (since
popup.hide() lives downstream of the skipped show()) — independent
regression detector from the main-window-visibility check.

S34 — adds focus-side assertion to what was a suppression-only
test. Upstream contract is `if (ut.isFullScreen()) { ut.focus();
ide(); }` — verify main is still fullscreen AND focused after the
shortcut. KDE-W/KDE-X hard-fail (focus is reliable on Plasma);
GNOME-W/Ubu-W soft-fixme (mutter routinely no-ops focus on
fullscreen surfaces).

S35 — three-launch shape: the existing two-launch position-memory
check plus an on-disk round-trip (read parsed config.json between
launches to confirm the save handler reached disk) plus a clear-
and-default check (delete the saved key, launch a third time,
assert the popup lands somewhere other than the cleared TARGET —
proves the test is reading the real store). Bumped per-test
timeout from 180_000 to 240_000.

== New harness self-tests (H-prefix) ==

Introduces an H-prefix convention for runners that validate the
harness's preconditions and the build pipeline's invariants —
distinct from T-tests (upstream test cases) and S-tests (doc-
spec entries). Cheap, fast, ground-truth what the other tests
assume.

H01 — CDP gate canary. Spawns bundled Electron with
`--remote-debugging-port=0` and no CLAUDE_CDP_AUTH; asserts exit
code 1 within 10s. If the gate is ever accidentally removed, this
fires before the rest of the L1 strategy silently weakens.

H02 — frame-fix-wrapper presence. Asserts both
`frame-fix-wrapper.js` and `frame-fix-entry.js` exist in app.asar,
the wrapper contains `Proxy(`, and `package.json#main` references
the entry. File probe — sub-second.

H03 — patch fingerprints. Manifest-based check for every
build-pipeline patch (KDE gate, frame-fix inject, tray
nativeTheme guard, cowork Linux daemon shutdown, claude-code
linux-arm64 branch). Catches silent build-orchestrator drift.

H04 — cowork daemon lifecycle. Baseline pgrep, launchClaude,
wait for daemon to spawn, app.close(), assert daemon is gone.
Soft-skips on rows where the daemon isn't gated to spawn (most
default builds today).

== claude.ai renderer UI domain wrapper ==

New `lib/claudeai.ts` centralizes renderer-DOM discovery for
claude.ai UI patterns. Same shape as `lib/quickentry.ts` —
domain class with discovery-by-shape, atom helpers, idempotent
mocks. Exports:

  - activateTab(name) — clicks Chat/Cowork/Code df-pill
  - installOpenDialogMock + getOpenDialogCalls — idempotent
    dialog.showOpenDialog mock + recorded calls
  - findCompactPills, openPill, clickMenuItem, pressEscape —
    atoms shared by future page objects
  - class CodeTab — activate(), openEnvPill(), selectLocal(),
    openFolderPicker() (full chain)

Discovery is by structural fingerprint, not Tailwind classes
(those rebuild). Probed against a live debugger to confirm:
df-pill is exactly 3 instances (Chat/Cowork/Code), compact-pill
distinguishes env pill (max-w-[200px]) from Select-folder pill
(max-w-[160px]) — same component shape, different label widths.

T17 refactored to use the new lib — went from ~470 lines of
inline DOM walking to ~70 lines of intent. When claude.ai
re-renders the Code tab, the fix is one file over, not per-spec.

== Library brittleness fixes ==

`lib/quickentry.ts`:
  - getStoredPosition rewritten to read configDir/Claude/config.json
    directly via electron-store's known JSON shape. Replaces a
    fragile globalThis-walk that matched any object with .get/.set
    returning a quickWindowPosition value.
  - LOGIN_URL_RE anchored: `^https?://[^/]+/(login|auth|sign[-_]?in)
    (?:[/?#]|$)`. Previous unanchored form would match
    /oauth/callback as still-on-login.
  - Dropped dead `skipTaskbar: false` field from
    getPopupRuntimeProps return shape (no caller used it; the
    hardcoded false was misleading).

`lib/inspector.ts`:
  - InspectorClient.close() is now idempotent — second close is a
    no-op. Both runners and electron.ts auto-close path can safely
    invoke it.

`lib/electron.ts`:
  - ClaudeApp tracks the attached inspector internally; app.close()
    auto-closes it (existing inline inspector.close() calls in
    runners stay working idempotently).
  - Module-level activeLaunches set + signal handlers ensure
    Ctrl-C during a sweep kills tracked Electron pids and rms
    isolation tmpdirs before re-emitting the signal.
  - app.lastExitInfo: { code, signal } | null exposes non-zero
    exit info post-close. Runners can attach when nonzero;
    nothing breaks when ignored.

== Config + orchestrator ==

`playwright.config.ts`:
  - retries: process.env.CI ? 1 : 0 (one retry in CI to absorb
    compositor flake; local stays at 0 so flakes surface).
  - forbidOnly: !!process.env.CI prevents stray test.only from
    sneaking through CI.
  - /// <reference types="node" /> for `process.env` access (the
    file isn't covered by tsconfig.json's `src/**/*` include).

`orchestrator/sweep.sh`:
  - Replaces the four `grep -oP ... | head -1` lines (which read
    only the first <testsuite> element) with a Node-based summary
    that sums tests/failures/errors/skipped across every suite.
  - Wrapped in `command -v node` guard with the legacy grep
    fallback retained inline.
  - Output line is byte-identical for downstream consumers.

== Cleanup + docs ==

  - README.md status table updated: 20 specs, 13 pass on KDE-W,
    six skip cleanly per spec intent. T17 row reflects the new
    end-to-end click chain.
  - lib/claudeai.ts and probe.ts added to the Layout section.
  - Deleted _investigate_t17_urls.spec.ts (one-off diagnostic
    that confirmed T17's /login was a fresh-isolation auth
    miss, not a webContents race).
  - Kept probe.ts as the seed for the explore CLI in the
    upcoming UI-mapping plan.

== UI mapping plan ==

`docs/testing/claudeai-ui-mapping-plan.md` — executable plan
for systematically mapping claude.ai's renderer UI into reusable
test-harness abstractions. Three layers: shape-based atoms,
page objects per major surface, discovery tooling. Phase 1
(explore CLI with snapshot/diff) and Phase 2 (UI map markdown)
are independent and can run in parallel; Phase 5 (drift
detection H05) depends on Phase 1.

== Validation ==

KDE-W sweep: 13 pass, 6 cleanly skip, 0 fail. 2.7 min total.
T17 verified end-to-end via the env-pill chain after refactor.
npx tsc --noEmit clean across all changes.

---
Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
70% AI / 30% Human
Claude: dispatched five parallel agents per file partition (libs / runners batch 1 / runners batch 2 / new H-tests / config), wrote the claudeai.ts extraction agent brief informed by live-debugger probe evidence, drafted the UI mapping plan
Human: scoped which improvements to make, called out skip vs fail edges (S34 KDE-strict / GNOME-fixme), shared live-renderer DOM dumps that ground-truthed T17's click chain (Code df-pill → env pill → Local → Select folder → Open folder), validated each step
Switches the inventory walker from a renderer-side
document.querySelectorAll IIFE to Chromium's accessibility tree
(Accessibility.getFullAXTree over CDP). Account-portable element
identification via ariaPath + role + AX-computed name; click path
moves to backendDOMNodeId via DOM.resolveNode + Runtime.callFunctionOn.

Walker (explore/walker.ts):
- snapshotSurface consumes AX nodes via axTreeToSnapshot
- waitForAxTreeStable gates seed snapshot, post-navigation snapshot,
  and every snapshotSurface call (Accessibility.enable lag is async;
  first read on a cold load returns 4 nodes vs 800+ when settled)
- redrivePath uses location.reload() instead of navigateTo to discard
  any state prior drills left in the SPA (open dialog, expanded
  sidebar, scrolled focus)
- captureFingerprint's isListRowChild extended: button + group
  ancestors, plus a sibling-count fallback (>=15 same-role siblings)
  for claude.ai's flat marketplace dialogs and complementary sidebar
- step 3 (positional) skipped for list-row children so they collapse
  via step 4's instance shape
- MAX_CONSECUTIVE_LOOKUP_FAILURES bumped 25 -> 75 for sidebar
  virtualization noise (timeout counter still gates real wedges)
- RawElement / RawAncestor reshaped: tagName / role / ariaLabel /
  textContent / dataState / parentChainSignature / ancestorAriaLabel
  dropped; backendDOMNodeId added; accessibleName is sole name source

Inspector (src/lib/inspector.ts):
- AxNode interface published
- clickByBackendNodeId: DOM.resolveNode + Runtime.callFunctionOn
  (replaces selector-based click reconstruction)

Name classifier (src/lib/name-classifier.ts):
- cowork-session shape regex (Idle|Ready|Awaiting input|...)
- row-more-options shape regex (^More options for )

Isolation (src/lib/isolation.ts):
- seedFromHost option: kill host Claude, copy auth-relevant subset of
  ~/.config/Claude into per-launch tmpdir for U01 / H05

Driver (explore/walk-isolated.ts):
- Replaces explore walk for safe walks: launches Claude inside the
  test-harness isolation rather than mutating the host profile

Runners:
- H05_ui_drift_check.spec.ts (claude.ai UI drift detection)
- U01_ui_visibility.spec.ts (placeholder stub; regenerated post-walk)

Self-test fixtures rewritten as synthetic AxNode trees fed through
axTreeToSnapshot; existing 7 plan-example traces produce identical
idTailFromFingerprint outputs.

Co-Authored-By: Claude <claude@anthropic.com>
Plan (docs/testing/fingerprint-v7-plan.md):
- Adds "Live-walk shakedown (post-Phase 2)" subsection enumerating
  the five real bugs the first end-to-end walks surfaced and their
  fixes (AX-stable gate, reload vs navigate, sibling-count list
  heuristic, two new instance shapes, threshold bump)
- Resolves three open questions with first-clean-walk data: CDP cost
  is not a bottleneck (817-node tree settles <1s), role overrides
  work as intended (Skip to content captured as link), no
  account-bound kind needed (existing pattern + heuristic + collapse
  cover the observed cases)
- Cross-references for walk-isolated.ts and clickByBackendNodeId

Learnings (docs/learnings/test-harness-ax-tree-walker.md):
- Five non-obvious AX-tree traps with symptoms + fixes:
  Accessibility.enable async lag, navigateTo no-op carrying state,
  claude.ai's flat dialog/complementary lists, per-row "More options
  for X" trigger needing its own shape, sidebar virtualization vs
  the lookup-failure threshold
- Closing note on driver choice (walk-isolated.ts over explore walk)

Prompts (docs/testing/fingerprint-v7-*-prompt.md):
- implementation-prompt: original v7 walker rewrite prompt
- ax-migration-prompt: DOM-walk -> AX-tree substrate migration prompt
- runners-prompt: NEW. Self-contained prompt for next session to wire
  U01 against the fresh inventory and iterate autonomously to a
  clean pass/drift/fail baseline

CLAUDE.md: link the new learnings doc

Inventory artifacts:
- ui-inventory.json + ui-inventory.meta.json: 90-entry inventory
  captured against claude.ai/epitaxy on app 1.5354.0 via
  walk-isolated.ts seedFromHost path. Marketplace dialog folded to
  single button-instance+704; cowork sidebar to button-instance+72;
  search history to option-instance+25
- ui-vocabulary.json: stable/suspect name corpus derived from prior
  walk
- ui-inventory-reconciliation.md: v6-era reconciliation notes
- ui-snapshots/{README.md,.gitkeep}: snapshots dir scaffold (JSON
  contents gitignored to avoid diff churn)

claudeai-ui-map.md: human-readable map of the inventory's reachable
surfaces

Matrix (docs/testing/matrix.md): U01 row added; entry-count phrasing
generalized so it doesn't go stale on each re-walk

Co-Authored-By: Claude <claude@anthropic.com>
U01 was a placeholder skipping with "v7 cutover — re-walk required";
the v7 walker has shipped a fresh inventory, so regenerate the spec
and land two resolver fixes the live sweep surfaced.

`findByFingerprint`: the strictness gate only consulted `kind`, so
entries with `kind: persistent` + `classification: instance` (the
post-walk persistent-collapse promotes degenerate-shaped fingerprints
when they appear on ≥3 surfaces) failed with "expected exactly one
match, got N". The fingerprint's own degenerate-shape claim should
win — defer to `classification === 'instance'` too.

`redrivePath`: the dangling `startUrl` parameter was the smoking
gun. After a prior test drilled into a deeper URL (e.g.
/settings/customize), `location.reload()` reloaded the deep URL
instead of returning to startUrl, and the next test's first
`clickById` saw a contaminated surface. Navigate to startUrl when
currentUrl has drifted; reload only when already at startUrl.

Sweep results across three runs: 73/17 → 89/1 → 89/1, with the
single failure being non-deterministic (different test each sweep,
both consistent with focus-management transients and sidebar
virtualization documented in docs/learnings/test-harness-ax-tree-walker.md).

Generator gate inverted to make the safe-by-default path
(seedFromHost: true) trigger when the env var is unset, mirroring
H05's pattern but with the seed lifted from the host config.

Co-Authored-By: Claude <claude@anthropic.com>
…migration

The three v7 handoff prompts (vocabulary scaffold, AX-tree
substrate migration, U-prefix runner wire-up) have all been
implemented and merged. Retire them — the design contract still
lives in fingerprint-v7-plan.md; the per-iteration prompts were
single-use scaffolding for fresh sessions.

Add claudeai-lib-ax-migration-prompt.md as the next-iteration
handoff: tools/test-harness/src/lib/claudeai.ts is still on the
old substrate (document.querySelector against minified-tailwind
shapes) and is the highest-payoff target for the v7 plan's "design
goal §2: Resilient to cosmetic drift". The prompt mirrors the
prior handoffs' structure (authoritative refs, code anchors,
phases, self-correction loop, termination conditions, final report
format) and scopes the spike at openPill before fanning out to
the rest of the file.

Co-Authored-By: Claude <claude@anthropic.com>
Replace every CSS-shape walk in lib/claudeai.ts with AX-tree queries
sourced from Chromium's Accessibility.getFullAXTree. Discovery now
reads role + accessibleName + hasPopup from the same substrate the v7
walker uses, dropping the brittle button[aria-haspopup=menu] +
span.truncate.max-w-[Npx] coupling that was the recurring break point
on every upstream tailwind regen.

Substrate change:
- inspector.ts: surface AxValue + AxProperty types; explicit
  properties? on AxNode so consumers can read state tokens.
- walker.ts: export RawElement, add hasPopup field, populate via
  readHasPopup() reading node.properties[].name === 'hasPopup'.
- selfTest Case 10 covers menu / 'false' / absent values.

Page-object migration (lib/claudeai.ts):
- snapshotAx() helper gates on waitForAxTreeStable by default
  (post-userLoaded the first AX read can return ~4 nodes — see
  docs/learnings/test-harness-ax-tree-walker.md §1).
- Polling loops in openPill (post-click) + clickMenuItem gate once
  upfront, then poll with { fast: true } so per-iteration stability
  re-checks don't fight the menuitem-appear poll.
- activateTab matches role:'button' + literal accessibleName.
- findCompactPills filters by role:'button' + hasPopup === 'menu',
  drops cowork sidebar via /^More options for / exclusion. Drops
  CompactPill.maxW field (tailwind artifact, only ever in error
  messages).
- openPill / clickMenuItem use clickByBackendNodeId for the click
  path — same backend-id flow the walker uses.

Live probe (explore/probe-claudeai-ax.ts) confirmed the discrimination
shapes against the host renderer — found 49 buttons with hasPopup
(48 menu, 1 dialog), env pill 'Local' resolves under main >
region[Primary pane], 37 cowork sidebar triggers correctly excluded
by the row-more-options filter. Caught one bug along the way: CDP
exposes the property as 'hasPopup' (camelCase), not 'haspopup' — the
synthetic selfTest fixture used the wrong casing too, so both sides
agreed on the wrong contract until the live probe surfaced it.

T17_folder_picker passes on KDE-W with CLAUDE_TEST_USE_HOST_CONFIG=1.

Co-Authored-By: Claude <claude@anthropic.com>
The 90-test U01 sweep was wired against an account-specific v7
inventory snapshot; running it during routine sweeps fired noise
against unrelated drift. The spec is auto-generated from the v7
inventory via npm run gen:render-specs, so this is a soft delete —
regenerate any time a fresh inventory walk lands.

Co-Authored-By: Claude <claude@anthropic.com>
Adds the implementation prompt for the next session: spawn one
subagent per file in docs/testing/cases/, have each one cross-check
its tests against the extracted Claude Desktop source under
build-reference/app-extracted/, and edit in place to add code
anchors / mark drift / flag missing features. Mirrors the
structure of the already-retired claudeai-lib-ax-migration-prompt.md
so the workflow is consistent.

Triggered by the AX migration validation surfacing how easily case
docs drift from upstream — the test author's "click X menu" can
silently diverge from upstream's actual labels two versions later,
and the failure looks like a Linux compat issue when it's really a
doc-vs-source drift.

Co-Authored-By: Claude <claude@anthropic.com>
Static anchor sweep: each test in docs/testing/cases/*.md now points at
the upstream code (or wrapper script) backing its load-bearing claim,
so the next sweep can tell "Linux compat regression" apart from "case
doc drifted while we weren't looking."

- 75 tests across 10 files reviewed
- 63 grounded with code anchors (index.js:N, scripts/*.sh:N)
- 9 drifted Steps/Expected corrected against actual upstream behavior
- 2 marked Missing in build (S12 Wayland portal flag, S26 auto-update)
- 1 flagged Ambiguous (T39 /desktop is a CLI surface, not Electron asar)

Notable corrections:
- T05: scheme is claude://, not https:// (project never registers
  x-scheme-handler/https; old spec was always going to fail on Linux)
- T15: sign-in is in-app loadURL into mainView, not xdg-open handoff
- T18: drag-attach uses webUtils.getPathForFile, not text/uri-list MIME
- T20: file conflict check is sha256-based, not mtime-based
- T22: gh-install path is macOS/brew-only on Linux/Windows
- T30: PR-close auto-archive wait is ~5-6 min (5m setInterval + 30s
  startup + 1h non-terminal cooldown), not "~1 minute"
- T14: PR #536 is closed/docs-only — no in-tree multi-instance flag

Inventory anchors added for renderer-side surfaces present in the
idle-state v7 capture (T16 Code tab, T17 select-folder, T26 Routines,
T11/T33 plugin nav). Surfaces inside modals/popups (T22 toolbar, T25
Show-in-Files context menu, T31 side chat, T32 slash menu) are flagged
for re-capture with the surface open.

S26 finding worth follow-up: autoUpdater gate is structurally open on
Linux when packaged (lii() at index.js:508761-508774 returns true with
ELECTRON_FORCE_IS_PACKAGED=true from launcher-common.sh:249) — saved
from real download attempts only by Electron's Linux autoUpdater being
unimplemented.

T07/S13 reference WCO-shim files that exist on main (PR #538 merged
2026-05-01) but not on this branch (docs/compat-matrix forked earlier);
anchors point at main: with explicit caveats.

Co-Authored-By: Claude <claude@anthropic.com>
Static greps against the 546k-line beautified bundle have known blind
spots — lazy require()s, dynamic handler tables, conditional wiring.
This probe connects to a running Claude Desktop via the existing
InspectorClient (port 9229, opened by launchClaude's SIGUSR1 path) and
dumps runtime state keyed by test-ID into a JSON the next grounding
sweep can diff across upstream versions.

Captures:
- App metadata (version, isPackaged, ready state)
- Full IPC handler registry (invoke + on channels)
- WebContents inventory (URLs, types)
- globalShortcut.isRegistered() for known accelerators
- app.getLoginItemSettings() (autostart resolution)
- safeStorage availability + backend (libsecret on Linux)
- autoUpdater.getFeedURL() — empirical answer to the S26 structural-
  open claim that static analysis couldn't resolve
- Notification.isSupported()

Read-only / non-destructive; observes API state, never clicks UI or
fires shortcuts. Records explicit gaps[] for surfaces it can't reach
from idle (S20 powerSaveBlocker enumeration; T22/T31/T32 contextual
renderer surfaces; T39 CLI binary).

Run: cd tools/test-harness && npm run grounding-probe
Output: /tmp/grounding-probe.json (override with --out PATH)

Co-Authored-By: Claude <claude@anthropic.com>
Two extensions to the grounding probe, each closing a gap I flagged on
the first cut:

- --launch: spins up a fresh isolated instance via launchClaude(),
  waits for 'mainVisible' (cheapest level that returns the inspector),
  captures, tears down. Default still attaches to an already-running
  app on port 9229; --launch is the self-contained / CI-usable path.

- --include-synthetic + S20 powerSaveBlocker probe: starts a blocker,
  reads isStarted, stops immediately. Brief inhibit (~ms). Read-only by
  default — synthetic state changes are opt-in. Doesn't verify the
  case-doc claim that keepAwakeEnabled toggles trigger this; that needs
  correlating settings IO with the `PhA` Set at index.js:241897, which
  depends on minified-name stability. Left to the next sweep.

Argv parser rewritten to handle bare flags (--launch, --include-synthetic)
alongside key/value pairs (--port 9229, --out PATH).

Co-Authored-By: Claude <claude@anthropic.com>
…els, SNI

Closes the bulk of the remaining gaps from the last cut:

- AX fingerprint of the current claude.ai webContents (role+name+
  hasPopup, reduced form). Stored once at the top level; per-test
  entries for T22/T26/T31/T32 reference it via { axFingerprintRef }.
  Captures whatever surface is on screen at probe time, so the user
  opens the slash menu / side chat / routines modal / PR toolbar
  before re-running to anchor those surfaces.

- Editor handoff IPC channels (T24/T38). Static anchor is `Mtt` at
  index.js:463902 — variable name is minified, so we match handlers
  by /external|editor|openIn/i name pattern instead. Sufficient to
  diff across upstream versions (renames will surface as removed
  channels with similar replacements).

- SNI / tray registration (T03). `findItemByPid()` from sni.ts attribu-
  tes a registered StatusNotifierItem to our pid. dbus-next is loaded
  via dynamic import so non-DBus environments (CI containers without a
  session bus) still get a partial probe rather than a hard fail.

Reduced gaps[] to just T39 (CLI surface, out-of-scope) and the
optional opt-outs (powerSaveBlocker without --include-synthetic;
empty AX fingerprint when claude.ai isn't loaded yet).

Co-Authored-By: Claude <claude@anthropic.com>
Branch was rebased onto main; scripts/wco-shim.js + scripts/patches/
wco-shim.sh are now on this branch via PR #538. The "lives on main, not
yet on docs/compat-matrix" notes the grounding subagent added are no
longer accurate — anchors point at files that exist locally.

Co-Authored-By: Claude <claude@anthropic.com>
@aaddrick aaddrick force-pushed the docs/compat-matrix branch from 2f2134d to ade75d7 Compare May 3, 2026 11:57
aaddrick and others added 4 commits May 3, 2026 08:00
Folds the conventions the grounding sweep landed into the README so
future authors and sweeps work from the same shape. Adds:

- **Code anchors:** field — `<file>:<line>` pointers to where the
  load-bearing claim is implemented.
- **Inventory anchor:** field — optional, for surfaces present in
  the v7 walker's idle capture.
- "Anchor scope" section codifying the four buckets (upstream code,
  wrapper, server-rendered SPA, CLI binary) and where to anchor each.
- "Drift markers" section codifying the Drifted / Missing / Ambiguous
  classifications the sweep already uses.

No content changes to existing case files — they already follow these
conventions in practice; the README now documents them.

Co-Authored-By: Claude <claude@anthropic.com>
…nd runs

Adds a top-level harness flag that flips every launchClaude() spawn from
the default X11-via-XWayland backend to native Wayland, so the full
suite can run under Wayland with a single env var instead of per-spec
plumbing.

Implementation mirrors scripts/launcher-common.sh:132-139:
- Renames LAUNCHER_INJECTED_FLAGS to LAUNCHER_INJECTED_FLAGS_X11 and
  adds LAUNCHER_INJECTED_FLAGS_WAYLAND with the launcher's Wayland
  flag set (UseOzonePlatform, WaylandWindowDecorations, ozone-platform,
  wayland-ime, wayland-text-input-version=3).
- harnessUseWayland() reads CLAUDE_HARNESS_USE_WAYLAND.
- launchClaude() picks the flag set, adds CLAUDE_USE_WAYLAND=1 and
  GDK_BACKEND=wayland to the spawn env. Spread order keeps caller-
  supplied extraEnv winning, so a single test can still opt back to X11
  inside a Wayland-mode sweep.
- sweep.sh advertises the mode on stderr.
- README documents the var + the npm-test recipe.

Default unchanged: every runner still gets X11. The flag opts in.

Verification (live): CLAUDE_HARNESS_USE_WAYLAND=1 npx playwright test
src/runners/T17_folder_picker.spec.ts, then while the app is up confirm
--ozone-platform=wayland is on argv via /proc/<pid>/cmdline. The
harness spawns Electron directly (CDP-gate workaround at electron.ts:
102), so launcher-common.sh isn't sourced and ~/.cache/claude-desktop-
debian/launcher.log is not written by harness runs.

Co-Authored-By: Claude <claude@anthropic.com>
The action items from the last few sessions (case-doc grounding,
runtime probe, autoUpdater issue, Wayland-mode runs) needed pointers
across the testing docs so the next contributor isn't reverse-
engineering them from git log.

- docs/testing/README.md — bump date, surface grounding sweep + probe
  in the automation-status section, fix the test corpus snapshot
  (S-tests went from 28 to 37 since this was last counted).
- docs/testing/runbook.md — add "Grounding sweep" section (static
  pass + runtime pass) alongside the existing test sweep, document
  the Wayland-mode sweep recipe, link upstream-bump trigger to it.
- tools/test-harness/README.md — add grounding-probe.ts to the
  layout, a Run-section recipe, and a dedicated "Grounding probe"
  section explaining when to reach for it vs the static grep.
- docs/testing/cases/distribution.md — link S26 to issue #567
  (autoUpdater no-op tracking), now that the bug is filed.

Co-Authored-By: Claude <claude@anthropic.com>
Counterpart to docs/testing/cases-grounding-prompt.md — a fan-out
prompt for the workstream of wiring runners against the 61 of 76
tests that don't have one yet.

Structured the same way as the grounding prompt: Phase 0 calibration,
Phase 1 triage subagent producing a tiered plan
(docs/testing/runner-implementation-plan.md), Phase 2/3 fan-out per
test in Tiers 1-2, Phase 4 synthesis. Tier 3 (renderer-heavy /
login-required) deferred to follow-up sessions; Tier 4 (CLI binary,
issue-gated, env-blocked) marked out of scope with reasons.

Constraints flag the known landmines: CDP gate workaround, the
BrowserWindow Proxy gotcha, default isolation + escape hatches,
ydotool prereqs, skipUnlessRow as the first line of every spec.
"Don't ship stubs" called out explicitly so a session that hits a
blocker reports it instead of leaving placeholder runners that pass
trivially.

Realistic next-session goal: 13-16 new runners (Tier 1 + as much
Tier 2 as fits), bumping coverage from 15/76 (20%) to ~30/76 (40%).
Future sessions handle the renderer-heavy Tier 3 once they have a
session-time budget and host claude.ai login.

Co-Authored-By: Claude <claude@anthropic.com>
aaddrick and others added 9 commits May 3, 2026 14:41
- runDoctor() now returns {output, exitCode} so T02/T13/S05 can
  assert against the doctor exit code (was string-only, swallowed
  the code).
- MainWindow.setState() accepts 'close' and calls win.close() so T08
  exercises frame-fix-wrapper.js:178-185 (the close-to-tray
  interceptor) — distinct from 'hide' which would bypass the
  wrapper.
- Add docs/testing/runner-implementation-plan.md: tiered triage of
  the 61 missing runners with execution-time reclassifications
  (T05 → Tier 3 delivery, T07 → Tier 2 via seedFromHost, T14 split
  into a/b, S20 deferred via #569).
- Refresh T13/S05 case-doc anchors: scripts/doctor.sh:290-299 →
  :353-362 (file edited since the anchor was written).
- Update test-harness README status to reflect the post-batch spec
  inventory and link to the plan doc.

Co-Authored-By: Claude <claude@anthropic.com>
Each runner is independent of the others and matches one case-doc
test ID. Pure file probes (asar fingerprints, source-tree grep) and
short-lived spawn probes; no app launch needed.

Specs landed:

- T02 — claude-desktop --doctor exit code is 0
- T11 — plugin install code path fingerprints (installPlugin log,
  installed_plugins.json) present in bundled index.js
- T13 — --doctor does not false-flag rpm/deb installs as
  missing-dpkg AppImage
- T14a — requestSingleInstanceLock + 'second-instance' strings in
  bundle (T14b runtime probe lands separately)
- S01 — AppImage launches without libfuse.so.2 complaint (skips
  cleanly on non-AppImage rows)
- S02 — no strict == equality against XDG_CURRENT_DESKTOP in
  launcher / patches (regression detector)
- S03 — dpkg-query Depends: field non-empty (currently fails as
  upstream-contract regression detector — deb.sh:185-197 emits no
  Depends: line)
- S04 — rpm -qR has at least one non-rpmlib(...) requirement
  (currently fails — rpm.sh:188 has AutoReqProv: no, no manual
  Requires:)
- S05 — doctor does not false-flag rpm-installed package
- S08 — KDE tray-rebuild fast-path (.setImage(...createFromPath...))
  injected by tray.sh:212-217
- S15 — AppImage --appimage-extract fallback exits 0; squashfs-root/
  AppRun --version runs without FUSE error
- S16 — AppImage mount(8) entry appears post-launch and clears
  within ~10s of close
- S21 — no handle-lid-switch / HandleLidSwitch strings in bundle
  (lid policy deferred to OS)
- S22 — new Set(["darwin","win32"]) computer-use platform gate
  present, no 2-element Set pairing linux (file-probe form)
- S26 — setFeedURL present + project suppression marker absent
  (currently fails — gated on #567 auto-update suppression patch)
- S27 — installed_plugins.json + homedir resolver present, no
  */plugins system paths in bundle

Three specs are intentional regression detectors — they ship "red"
today (S03, S04, S26) because the upstream contract isn't yet met.
Each error message names the upstream defect or issue so matrix-regen
surfaces them as actionable cells.

Co-Authored-By: Claude <claude@anthropic.com>
Single launchClaude() + inspector + Electron-API or window-state
assertion. Each runner asserts a contract that requires the app to
actually be running.

Specs landed:

- T05 — claude:// URL delivers via app.on('second-instance')
  (Tier 3 delivery probe: xdg-open fires the URL, the running app's
  hook captures it). Uses isolation: null because the SingletonLock
  collision must route to the same user-data dir.
- T06 — globalShortcut.isRegistered('Ctrl+Alt+Space') returns true
  after waitForReady('mainVisible')
- T07 — five topbar buttons render with non-zero rects. First spec
  to exercise createIsolation({ seedFromHost: true }) — kills host
  Claude, copies auth allowlist (Cookies, Local State, Local Storage,
  IndexedDB, etc.) into per-test tmpdir, runs hermetically against
  signed-in account, tmpdir destroyed on close.
- T08 — MainWindow.setState('close') fires the wrapper's close
  interceptor; window hidden, proc still alive
- T09 — setLoginItemSettings({ openAtLogin }) writes/removes
  $XDG_CONFIG_HOME/autostart/claude-desktop.desktop
- T12 — app.getGPUFeatureStatus() returns populated object;
  reaching mainVisible proves the renderer didn't crash
- T14b — second invocation under same isolation exits cleanly via
  requestSingleInstanceLock early-return; primary pid stays alive
- S07 — under CLAUDE_HARNESS_USE_WAYLAND=1, spawned Electron has
  --ozone-platform=wayland on argv (skips when env unset)
- S17 — shell-path-worker overlays the user's login-shell PATH onto
  a deliberately-scrubbed env. Re-forks shellPathWorker.js via
  utilityProcess.fork + MessageChannelMain to observe the worker
  output directly (the main-process FX() merger only fills undefined
  keys, so reading process.env.PATH after a non-undefined override
  wouldn't observe the effect).

T05 originally planned as a Tier 2 isDefaultProtocolClient probe
but reshaped — that runtime call is a no-op in the harness because
ELECTRON_FORCE_IS_PACKAGED=true makes app.getName() resolve to
"Claude" (not "claude-desktop"), so the xdg-mime shellout fails
silently. Real registration is install-time via the .desktop file
MimeType= line. T05 ships as the delivery probe instead.

T07 originally deferred to Tier 3 ("topbar is React-rendered SPA")
but the harness's seedFromHost primitive (isolation.ts:37-44, never
exercised before this commit) lifts it back to Tier 2.

Co-Authored-By: Claude <claude@anthropic.com>
Mirrors lib/claudeai.ts:installOpenDialogMock (used by T17). Replaces
electron.shell.showItemInFolder with a recording mock so Tier 2
reframe specs can assert "the IPC layer reaches the egress with the
right path" without firing the real DBus FileManager1 / xdg-open
dispatch on the host.

Idempotent (guarded by globalThis.__claudeAiShowItemMockInstalled),
matches the existing mock helper's call-recording shape, exports a
companion getShowItemInFolderCalls reader. Used by the rewritten T25
runner in the next commit.

Co-Authored-By: Claude <claude@anthropic.com>
Categories landed:
- B (seedFromHost-unlocked): T16 (Code tab loads), T26 (Routines page
  renders) — both promote Tier 3 → Tier 2 via the seedFromHost
  primitive shipped in session 1.
- A (Tier 2 single-launch deferred from session 1): T10 (Cowork daemon
  respawn after SIGKILL), S10 (KDE-W Quick Entry popup transparent),
  S25 (safeStorage round-trip across two launches with shared
  isolation handle).
- C (Tier 2 reframes): T23 (Notification reaches DBus via dbus-monitor
  subprocess), T25 (shell.showItemInFolder via mock-then-call —
  mirrors T17's installOpenDialogMock), T38 (openInEditor IPC handler
  registered probe via ipcMain._invokeHandlers), S19
  (CLAUDE_CONFIG_DIR extraEnv reaches main process).
- Tier 1 reclass: S28 (worktree permission classifier asar fingerprint
  — Sbn() is closure-local, not inspector-reachable).

Mechanism notes — see plan doc status section for full rationale:
- T23 uses dbus-monitor not gdbus monitor (the latter only sees
  signals owned by a destination, not method calls to it).
- T38 inspects ipcMain._invokeHandlers for handler registration; the
  channel ends in $eipc_message$_<UUID>_$_claude.web_$_<name> with a
  build-stable UUID prefix — anchors on the suffix.
- T25 mock-then-call beats invoke-then-cleanup (no host file manager
  pop-up, stronger assertion).
- S25 compares decrypted plaintexts not ciphertexts (safeStorage on
  Linux uses random IVs).

Co-Authored-By: Claude <claude@anthropic.com>
- runner-implementation-plan.md: new "Status (post-execution)" sub-
  section for session 2 listing the 10 new specs and the four
  reclassification notes (S28 → Tier 1, T38 framing, T23 tool choice,
  S19 honest-stub note). Session 1 sub-section preserved verbatim
  below for comparison.
- README.md: 50-spec inventory (was 40), new T-rows (T10, T16, T23,
  T25, T26, T38) and S-rows (S10, S19, S25, S28) interleaved into
  the existing tables. Substrate-primitives paragraph extended with
  dbus-monitor, mock-then-call, ipcMain registry introspection,
  safeStorage round-trip, extraEnv precedence.
- runner-implementation-followup-prompt.md: rewritten for session 3
  — deferred items (T31, T32, S06, S11, S14), Tier 3 → Tier 2
  reframes (T22, T35, T37), asar fingerprint cleanups (T24, T30,
  T33), the focus-shifter primitive build, and the mock-then-call
  extension for T24 as an alternative to its asar form. Includes
  the "known mechanism-recipe table" cumulating sessions 1+2.
- runner-implementation-prompt.md: deleted (session 1's prompt,
  superseded by the followup that's been the rolling document
  since session 1 ended).

Co-Authored-By: Claude <claude@anthropic.com>
… helper

Session 3 brings the third mock-then-call helper online
(installOpenExternalMock for shell.openExternal, mirroring
installShowItemInFolderMock and installOpenDialogMock). Threshold from
the session prompt was met — pull the three install/get pairs out of
lib/claudeai.ts into a dedicated lib/electron-mocks.ts. The mocks are
generic Electron module patches (dialog, shell), not claude.ai-domain,
so the new home keeps claudeai.ts focused on AX-tree page-objects.

T17, T25 imports updated to point at the new module. T24 (added in the
follow-up commit) imports from electron-mocks.ts directly.

Co-Authored-By: Claude <claude@anthropic.com>
Coverage 50/76 → 57/76. Seven new specs land + one session-2 carryover
(T38) reclassified after the eipc-registry finding below.

New specs:

- T22 (PR monitoring) — Tier 1 fingerprint: LocalSessions_$_getPrChecks
  eipc channel name + "gh CLI not found in PATH" Linux-fallthrough
  throw site (case-doc anchors :464281 / :464964 / :464368).
- T24 (Open in editor) — Tier 2 mock-then-call: installOpenExternalMock
  patches shell.openExternal from main, evalInMain calls it with a
  vscode://file/... URL, assert recorded call lists URL verbatim. No
  real editor launch (mock returns Promise<boolean>).
- T30 (Auto-archive cadence) — Tier 1 fingerprint: single regex
  anchoring 300*1e3 ≤ 3600*1e3 ≤ AutoArchiveEngine in colocation
  (≤200 / ≤3000 char proximity windows tuned to current bundle), plus
  ccAutoArchiveOnPrClose .includes() inside the captured window.
- T31 (Side chat) — Tier 1 fingerprint: side-chat eipc trio
  (startSideChat / sendSideChatMessage / stopSideChat).
- T32 (Slash menu) — Tier 1 fingerprint:
  LocalSessions_$_getSupportedCommands + slashCommands schema.
- T33 (Plugin browser) — Tier 1 fingerprint:
  CustomPlugins_$_listMarketplaces + listAvailablePlugins.
- T37 (CLAUDE.md memory) — Tier 1 fingerprint: high-signal
  "[GlobalMemory] Copied CLAUDE.md" log line + CLAUDE.md filename +
  CLAUDE_CONFIG_DIR env-var token. Fixture-readback form deferred —
  parsed-memory state is closure-local.

eipc-registry finding (T38 reclassification):

Session 2's T38 used ipcMain._invokeHandlers introspection. KDE-W run
revealed that registry holds only three chat-tab MCP-bridge handlers
(list-mcp-servers, connect-to-mcp-server, request-open-mcp-settings)
regardless of ready level (mainVisible / claudeAi / userLoaded) and
regardless of authentication state (default isolation vs.
seedFromHost: true verified via probe). The
$eipc_message$_<UUID>_$_claude.web_$_<name> protocol uses a closure-
local message-port registry not reachable from globalThis — same
gotcha as session 2's Sbn() (S28) and cE()/Tce() (S19).

T38 rewritten as a Tier 1 asar fingerprint anchoring on the
LocalSessions_$_openInEditor channel-name string in the bundle. T22,
T31, T33 (originally drafted with the same broken pattern) ship as
Tier 1 fingerprints from the start. T24 is unaffected — it patches
the stdlib Electron shell module from main, not the eipc layer.

KDE-W: 9/9 pass in 18.2s (7 new + T25 verifying the lib import-extract
didn't break it + T38 reclassified).

Co-Authored-By: Claude <claude@anthropic.com>
Updates the post-execution status section with session 3's seven
shipped specs, the eipc-registry finding (corrects session 2's T38
assumption), and the four reclassifications (T22/T31/T33/T38 from
Tier 2 IPC probes to Tier 1 fingerprints). Captures the
authentication-state lesson too — launches that depend on
authenticated renderer state need createIsolation({ seedFromHost:
true }), even if the case-doc-shaped Tier 2 form looks hermetic on
paper.

README inventory grows from 50 to 57 specs and adds a note that
LocalSessions_$_* / CustomPlugins_$_* channels use a custom eipc
protocol, not Electron's standard ipcMain.handle() — so future
runners should anchor on channel-name strings (Tier 1) rather than
introspect _invokeHandlers (broken).

Followup prompt rewritten for session 4: focus-shifter primitive +
S11/S14, T35 MCP separation fingerprints (Phase 1) and optional
fixture-readback (Phase 2, may abort), and the eipc-registry
exposer as a flagged primitive gap.

Co-Authored-By: Claude <claude@anthropic.com>
@aaddrick
Copy link
Copy Markdown
Owner Author

aaddrick commented May 3, 2026

Closing this WIP — will redraft once the test-plan + harness work is finished. Branch stays for ongoing iteration.

@aaddrick aaddrick closed this May 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant