Skip to content

feat(gui): native GUI tools with builtin tool call observability#3

Open
s-JoL wants to merge 17 commits into
mainfrom
codex/gui-native-tools-v2
Open

feat(gui): native GUI tools with builtin tool call observability#3
s-JoL wants to merge 17 commits into
mainfrom
codex/gui-native-tools-v2

Conversation

@s-JoL
Copy link
Copy Markdown
Collaborator

@s-JoL s-JoL commented Apr 2, 2026

Summary

  • Add native macOS GUI tools (observe, click, drag, scroll, type, key, move, wait) with semantic grounding engine
  • Promote GUI tool calls to first-class BuiltinToolCall turn items for TUI/CLI observability (like exec commands)
  • Add experimental coordinate targeting placeholder (disabled by default, [tools.gui] coordinate_targeting = true)
  • Add grounding benchmark infrastructure comparing separate grounding vs main-model coordinate strategies

Demo

GUI Demo

Commit breakdown

Feature (9 commits):

  1. Protocol: BuiltinToolCallItem + BuiltinToolCallStatus types (+125 lines)
  2. Tool schemas: gui_tool.rs with all tool schema definitions (+976 lines)
  3. Platform runtime: Swift helpers, platform abstraction, readiness checks (+3,172 lines)
  4. Grounding engine: Semantic target resolution with validation and retry (+2,155 lines)
  5. Handler: Main GUI tool handler with observe-act-verify loop (+3,053 lines)
  6. Config: System prompt instructions, gui_coordinate_targeting option (+95 lines)
  7. Observability: ItemStarted/ItemCompleted events for GUI tool calls (+342 lines)
  8. UI rendering: TUI BuiltinToolCallCell, CLI human/JSONL output (+494 lines)
  9. Tests: Unit tests, grounding benchmark, schema fixtures (+4,696 lines)

Fixes & improvements (6 commits):
10. Fix: Do not re-activate app during post-action evidence and wait probes (+9/-2)
11. Fix: Async safety — switch std::sync::Mutex to tokio::sync::Mutex, cache readiness snapshot with 30s TTL, replace hardcoded zsh with /bin/sh, remove dead code in TUI replay (+51/-16)
12. Fix: Screenshot redaction crash, TOCTOU race in helper binary check, edge grounding bounds (+97/-24)
13. Refactor: Move blocking platform calls to spawn_blocking, add CaptureMode enum, improve robustness (+3,685/-3,416)
14. Fix: Use modern activate() API on macOS 14+ (old activate(options:) is deprecated), guard releaseMouseButtons with button-state checks to avoid spurious events (+12/-4)
15. Feature: Hide non-target apps during GUI actions and restore on completion, preventing accidental interaction with unrelated windows and ensuring clean screenshots. Add demo GIF. (+147/-6)

Test plan

  • All codex-core, codex-exec, codex-tui, codex-tools, codex-protocol, codex-app-server-protocol tests pass
  • Schema fixtures regenerated and verified
  • Manual: enable gui_tools feature, verify GUI tool calls show in TUI like exec commands
  • Manual: run grounding benchmark on macOS with CODEX_GUI_GROUNDING_CASE_IDS

🤖 Generated with Claude Code

s-JoL and others added 10 commits April 2, 2026 18:00
Add protocol-level type definitions for built-in tool calls (e.g. native
GUI tools). Includes:
- BuiltinToolCallItem and BuiltinToolCallStatus in codex-protocol items
- BuiltinToolCall variant in ThreadItem with From<CoreTurnItem> conversion
- BuiltinToolCallStatus in app-server-protocol v2 with From impl
- ThreadHistoryBuilder support for BuiltinToolCall turn items
- Test coverage for BuiltinToolCall round-trip conversion

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add Responses API tool definitions for native GUI interaction tools:
- gui_observe: capture screenshots with optional app/window targeting
- gui_click: click at semantic target or coordinates with modifiers
- gui_type: type text or key sequences
- gui_key: send keyboard shortcuts
- gui_scroll: scroll in a direction at a location
- gui_drag: drag from one point to another
- gui_move: move cursor to a position
- gui_wait: pause for a specified duration

Each tool supports both semantic targeting (app name, window title)
and optional coordinate-based targeting via GuiToolSchemaOptions.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add the platform abstraction layer for native GUI tool execution:

- platform.rs: trait-based platform abstraction for screen capture,
  mouse/keyboard control, accessibility queries
- platform_macos.rs: macOS implementation via Swift helper process
- platform_windows.rs: Windows stub (not yet implemented)
- native_helper.swift: Swift helper for macOS screen capture, mouse
  events, keyboard simulation, and accessibility tree queries
- type_system_events.applescript: AppleScript helper for typing via
  System Events (fallback for apps that block CGEvents)
- provider.rs: semantic target resolution via grounding engine
- readiness.rs: environment capability detection and tool filtering
- session.rs: per-session GUI state management with emergency stop

These files are not yet wired into the build (mod.rs not updated)
and will be activated in a subsequent commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add the grounding module that resolves natural-language GUI targets
(e.g. "the Search button") to screen coordinates using the platform
accessibility tree and optional vision-based fallback:

- Accessibility-first resolution via AX tree traversal
- Multi-strategy matching: exact, prefix, suffix, containment, fuzzy
- Coordinate deduplication and bounding-box merging
- Support for grounding mode selection (accessibility, vision, hybrid)
- Rich diagnostic output for observability

This file is not yet wired into the build and will be activated in
the next commit along with the main GUI handler.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add the main GUI tool handler (gui.rs) implementing the full
observe-act-verify execution loop for native GUI tools:

- GuiHandler implementing ToolHandler trait for all gui_* tools
- Observe: screen capture with app/window targeting and AX tree
- Act: click, type, key, scroll, drag, move via platform abstraction
- Verify: post-action observation with wait-until-stable polling
- Semantic target resolution via grounding engine
- Coordinate-space normalization (image pixels vs display points)
- Emergency stop monitoring during long-running operations
- Rich structured output with capture details and diagnostics

Also adds gui_tools and gui_coordinate_targeting fields to ToolsConfig
to control GUI tool availability and coordinate targeting mode.

Wire the gui module into core/src/tools/handlers/mod.rs and export
GuiHandler.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add GUI-specific system prompt instructions and configuration:

- gui_instructions.rs: render_gui_tools_section() produces the
  "Native GUI Tools" system prompt block with semantic targeting
  guidance, observe-act-verify loop instructions, and optional
  coordinate targeting notes
- config/mod.rs: add gui_coordinate_targeting field to Config,
  GuiToolsToml struct for [tools.gui] TOML section, and
  resolve_gui_coordinate_targeting() resolution function
- lib.rs: register gui_instructions module

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
… items

Wire GUI tool calls into the turn-item lifecycle so they appear as
BuiltinToolCall events in the event stream:

- codex.rs: add is_visible_builtin_tool_call() predicate, emit
  BuiltinToolCall turn items on FunctionCall start and
  FunctionCallOutput completion, handle drain_in_flight conversion
- state/session.rs: add visible_builtin_tool_calls HashMap for
  tracking in-progress builtin tool calls
- stream_events_utils.rs: add response_input_to_response_item()
  conversion, hook into handle_output_item_done for builtin calls
- tools/spec.rs: register GUI tool specs and GuiHandler when
  gui_tools is enabled in ToolsConfig
- tools/tool_config.rs: add with_gui_coordinate_targeting builder
  method
- features: add Feature::GuiTools as experimental feature with
  toggle in /experimental menu
- app-server: pass gui_coordinate_targeting through to ToolsConfig

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add rendering support for BuiltinToolCall turn items across all
output surfaces:

- history_cell.rs: add BuiltinToolCallCell with tool name, arguments,
  output, and status rendering for the TUI history view
- chatwidget.rs: add on_builtin_tool_call_begin/end handlers to
  create and update BuiltinToolCallCell entries in the chat widget
- app_server_adapter.rs: map ThreadItem::BuiltinToolCall to core
  TurnItem::BuiltinToolCall for thread history reconstruction
- event_processor_with_human_output.rs: render BuiltinToolCall events
  as colored terminal output with tool name, arguments, and result
- event_processor_with_jsonl_output.rs: serialize BuiltinToolCall
  events as structured JSONL output with full detail
- exec_events.rs: add BuiltinToolCallItem and BuiltinToolCallStatus
  types for the exec event stream

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add comprehensive test coverage for the GUI tools feature:

- gui/tests.rs: unit tests for argument parsing, coordinate space
  normalization, scroll planning, target resolution, grounding
  diagnostics, and full handler integration tests including observe,
  click, type, scroll, drag, key, and wait operations
- gui/benchmark_renderer.swift: Swift helper that renders a
  deterministic GUI benchmark surface for grounding accuracy tests
- spec_tests.rs: verify GUI tools appear when Feature::GuiTools is
  enabled and coordinate targeting adds coordinate fields
- codex_tests.rs: test BuiltinToolCall turn item emission for
  visible builtin tool calls
- config_tests.rs: test gui_coordinate_targeting config resolution
  from TOML
- chatwidget/tests/app_server.rs: test TUI BuiltinToolCall thread
  item mapping

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
… probes

After a GUI action (click, drag, type, etc.), the post-action evidence
capture was re-activating the target app via observe_platform(..., true, ...).
This caused transient UI states like selected chess pieces or hover
highlights to be reset before the evidence screenshot was taken.

Similarly, gui_wait polling probes were re-activating on every iteration,
which could dismiss popups or reset selections being monitored.

Fix: set activate_app=false in capture_post_action_evidence and
probe_semantic_target_before_deadline. The app is already active from
the initial grounding step or user-initiated gui_observe.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@s-JoL s-JoL force-pushed the codex/gui-native-tools-v2 branch from bb828ae to 5137794 Compare April 2, 2026 11:27
s-JoL and others added 2 commits April 2, 2026 20:23
… cleanup

- Switch GuiHandler.observe_state from std::sync::Mutex to tokio::sync::Mutex
  to prevent potential tokio worker thread blocking
- Cache readiness_snapshot with 30s TTL to avoid per-call swiftc invocations
- Replace hardcoded "zsh" with "/bin/sh" in secret_command_env_var execution
- Document clipboard clobbering limitation in native_helper.swift pasteText
- Reformat gui_instructions.rs system prompt as multi-line string for readability
- Remove unreachable dead-code BuiltinToolCall match arm in chatwidget.rs

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ding

- Replace NSImage.lockFocus()/TIFF pipeline with CGContext/CGImageDestination
  in native_helper.swift to fix CGImageDestinationFinalize failure in CLI
  processes without a window server connection
- Also replace NSBitmapImageRep in handleRedactHostWindows with CGImage for
  the same reason, and handle images without alpha channel
- Add OnceLock cache and atomic rename for helper binary compilation to fix
  TOCTOU race when concurrent GUI tool calls resolve simultaneously
- Clamp grounding bounding boxes to image bounds instead of rejecting them,
  so controls at window edges are still clickable

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Use single-pass box-fit in constrain_model_dimensions to eliminate
  sub-pixel aspect ratio skew on extreme aspect ratios
- Use exclusive upper bound (<) in point-within-capture checks to
  prevent off-by-one at image edges
- Drop wasteful quality-85 JPEG step (initial encode uses quality-80)
- Move blocking platform calls to spawn_blocking
- Add CaptureMode enum and improve robustness

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@s-JoL s-JoL force-pushed the codex/gui-native-tools-v2 branch from af76733 to d8157b5 Compare April 2, 2026 18:41
s-JoL and others added 2 commits April 3, 2026 13:24
- activateApplication: use the non-deprecated `activate()` on macOS 14+,
  falling back to `activate(options:)` on older versions.
- releaseMouseButtons: check button state before posting spurious up events.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
… demo

- Hide non-target apps when a GUI action starts and restore them on
  completion (or on drop), keeping the screen clean for screenshots
  and preventing accidental interaction with unrelated windows.
- Guard releaseMouseButtons() with CGEventSource.buttonState so that
  cleanup only sends mouseUp events when buttons are actually pressed,
  fixing chess-piece deselection caused by spurious mouseUp events.
- Use the modern NSRunningApplication.activate() API on macOS 14+
  to silence the deprecation warning for activateIgnoringOtherApps.
- Add gui-demo.gif showcasing CUA capabilities.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
s-JoL and others added 2 commits April 3, 2026 22:46
…g support

Add a new `gui_batch` tool that executes multiple independent GUI actions
in a single call for faster task completion. Key features:

- Batch actions: click, type, key, scroll, drag in one tool call
- Parallel grounding: N targets grounded concurrently (2.5x speedup)
- Unified grounding: multi-target predictor+validator rounds (experimental)
- Parallel drag: single-step gui_drag now grounds both endpoints in parallel
- Configurable via `[tools.gui] batch_grounding_strategy` or per-call override
- One screenshot + one evidence capture per batch instead of N

Measured: 10-step batch with 5 click targets: 77s individual → 29s parallel.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…and batch config

- Convert Swift native_helper to a persistent `serve` mode subprocess
  using JSON-line protocol over stdin/stdout, eliminating per-call spawn
  overhead. Falls back to one-shot mode on connection failure.
- Compress screenshots from PNG to JPEG (quality 75) for 3-5x smaller
  payloads and reduced token consumption.
- Fix gui_key("Escape") triggering false emergency stops by adding
  suppress_next flag to GuiEmergencyStopMonitor.
- Make gui_batch configurable: add batch_enabled, batch_action_delay_ms
  config options; move grounding_strategy from tool param to config.
- Add local-release Cargo profile for fast local builds (no LTO).
- Fix stdout restoration bug in Swift serve error path.
- Fix doc comments for batch_action_delay_ms default value (0, not 3000).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant