feat(gui): native GUI tools with builtin tool call observability#3
Open
s-JoL wants to merge 17 commits into
Open
feat(gui): native GUI tools with builtin tool call observability#3s-JoL wants to merge 17 commits into
s-JoL wants to merge 17 commits into
Conversation
Add protocol-level type definitions for built-in tool calls (e.g. native GUI tools). Includes: - BuiltinToolCallItem and BuiltinToolCallStatus in codex-protocol items - BuiltinToolCall variant in ThreadItem with From<CoreTurnItem> conversion - BuiltinToolCallStatus in app-server-protocol v2 with From impl - ThreadHistoryBuilder support for BuiltinToolCall turn items - Test coverage for BuiltinToolCall round-trip conversion Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add Responses API tool definitions for native GUI interaction tools: - gui_observe: capture screenshots with optional app/window targeting - gui_click: click at semantic target or coordinates with modifiers - gui_type: type text or key sequences - gui_key: send keyboard shortcuts - gui_scroll: scroll in a direction at a location - gui_drag: drag from one point to another - gui_move: move cursor to a position - gui_wait: pause for a specified duration Each tool supports both semantic targeting (app name, window title) and optional coordinate-based targeting via GuiToolSchemaOptions. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add the platform abstraction layer for native GUI tool execution: - platform.rs: trait-based platform abstraction for screen capture, mouse/keyboard control, accessibility queries - platform_macos.rs: macOS implementation via Swift helper process - platform_windows.rs: Windows stub (not yet implemented) - native_helper.swift: Swift helper for macOS screen capture, mouse events, keyboard simulation, and accessibility tree queries - type_system_events.applescript: AppleScript helper for typing via System Events (fallback for apps that block CGEvents) - provider.rs: semantic target resolution via grounding engine - readiness.rs: environment capability detection and tool filtering - session.rs: per-session GUI state management with emergency stop These files are not yet wired into the build (mod.rs not updated) and will be activated in a subsequent commit. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add the grounding module that resolves natural-language GUI targets (e.g. "the Search button") to screen coordinates using the platform accessibility tree and optional vision-based fallback: - Accessibility-first resolution via AX tree traversal - Multi-strategy matching: exact, prefix, suffix, containment, fuzzy - Coordinate deduplication and bounding-box merging - Support for grounding mode selection (accessibility, vision, hybrid) - Rich diagnostic output for observability This file is not yet wired into the build and will be activated in the next commit along with the main GUI handler. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add the main GUI tool handler (gui.rs) implementing the full observe-act-verify execution loop for native GUI tools: - GuiHandler implementing ToolHandler trait for all gui_* tools - Observe: screen capture with app/window targeting and AX tree - Act: click, type, key, scroll, drag, move via platform abstraction - Verify: post-action observation with wait-until-stable polling - Semantic target resolution via grounding engine - Coordinate-space normalization (image pixels vs display points) - Emergency stop monitoring during long-running operations - Rich structured output with capture details and diagnostics Also adds gui_tools and gui_coordinate_targeting fields to ToolsConfig to control GUI tool availability and coordinate targeting mode. Wire the gui module into core/src/tools/handlers/mod.rs and export GuiHandler. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add GUI-specific system prompt instructions and configuration: - gui_instructions.rs: render_gui_tools_section() produces the "Native GUI Tools" system prompt block with semantic targeting guidance, observe-act-verify loop instructions, and optional coordinate targeting notes - config/mod.rs: add gui_coordinate_targeting field to Config, GuiToolsToml struct for [tools.gui] TOML section, and resolve_gui_coordinate_targeting() resolution function - lib.rs: register gui_instructions module Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
… items Wire GUI tool calls into the turn-item lifecycle so they appear as BuiltinToolCall events in the event stream: - codex.rs: add is_visible_builtin_tool_call() predicate, emit BuiltinToolCall turn items on FunctionCall start and FunctionCallOutput completion, handle drain_in_flight conversion - state/session.rs: add visible_builtin_tool_calls HashMap for tracking in-progress builtin tool calls - stream_events_utils.rs: add response_input_to_response_item() conversion, hook into handle_output_item_done for builtin calls - tools/spec.rs: register GUI tool specs and GuiHandler when gui_tools is enabled in ToolsConfig - tools/tool_config.rs: add with_gui_coordinate_targeting builder method - features: add Feature::GuiTools as experimental feature with toggle in /experimental menu - app-server: pass gui_coordinate_targeting through to ToolsConfig Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add rendering support for BuiltinToolCall turn items across all output surfaces: - history_cell.rs: add BuiltinToolCallCell with tool name, arguments, output, and status rendering for the TUI history view - chatwidget.rs: add on_builtin_tool_call_begin/end handlers to create and update BuiltinToolCallCell entries in the chat widget - app_server_adapter.rs: map ThreadItem::BuiltinToolCall to core TurnItem::BuiltinToolCall for thread history reconstruction - event_processor_with_human_output.rs: render BuiltinToolCall events as colored terminal output with tool name, arguments, and result - event_processor_with_jsonl_output.rs: serialize BuiltinToolCall events as structured JSONL output with full detail - exec_events.rs: add BuiltinToolCallItem and BuiltinToolCallStatus types for the exec event stream Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Add comprehensive test coverage for the GUI tools feature: - gui/tests.rs: unit tests for argument parsing, coordinate space normalization, scroll planning, target resolution, grounding diagnostics, and full handler integration tests including observe, click, type, scroll, drag, key, and wait operations - gui/benchmark_renderer.swift: Swift helper that renders a deterministic GUI benchmark surface for grounding accuracy tests - spec_tests.rs: verify GUI tools appear when Feature::GuiTools is enabled and coordinate targeting adds coordinate fields - codex_tests.rs: test BuiltinToolCall turn item emission for visible builtin tool calls - config_tests.rs: test gui_coordinate_targeting config resolution from TOML - chatwidget/tests/app_server.rs: test TUI BuiltinToolCall thread item mapping Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
… probes After a GUI action (click, drag, type, etc.), the post-action evidence capture was re-activating the target app via observe_platform(..., true, ...). This caused transient UI states like selected chess pieces or hover highlights to be reset before the evidence screenshot was taken. Similarly, gui_wait polling probes were re-activating on every iteration, which could dismiss popups or reset selections being monitored. Fix: set activate_app=false in capture_post_action_evidence and probe_semantic_target_before_deadline. The app is already active from the initial grounding step or user-initiated gui_observe. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
bb828ae to
5137794
Compare
… cleanup - Switch GuiHandler.observe_state from std::sync::Mutex to tokio::sync::Mutex to prevent potential tokio worker thread blocking - Cache readiness_snapshot with 30s TTL to avoid per-call swiftc invocations - Replace hardcoded "zsh" with "/bin/sh" in secret_command_env_var execution - Document clipboard clobbering limitation in native_helper.swift pasteText - Reformat gui_instructions.rs system prompt as multi-line string for readability - Remove unreachable dead-code BuiltinToolCall match arm in chatwidget.rs Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ding - Replace NSImage.lockFocus()/TIFF pipeline with CGContext/CGImageDestination in native_helper.swift to fix CGImageDestinationFinalize failure in CLI processes without a window server connection - Also replace NSBitmapImageRep in handleRedactHostWindows with CGImage for the same reason, and handle images without alpha channel - Add OnceLock cache and atomic rename for helper binary compilation to fix TOCTOU race when concurrent GUI tool calls resolve simultaneously - Clamp grounding bounding boxes to image bounds instead of rejecting them, so controls at window edges are still clickable Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
- Use single-pass box-fit in constrain_model_dimensions to eliminate sub-pixel aspect ratio skew on extreme aspect ratios - Use exclusive upper bound (<) in point-within-capture checks to prevent off-by-one at image edges - Drop wasteful quality-85 JPEG step (initial encode uses quality-80) - Move blocking platform calls to spawn_blocking - Add CaptureMode enum and improve robustness Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
af76733 to
d8157b5
Compare
- activateApplication: use the non-deprecated `activate()` on macOS 14+, falling back to `activate(options:)` on older versions. - releaseMouseButtons: check button state before posting spurious up events. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
… demo - Hide non-target apps when a GUI action starts and restore them on completion (or on drop), keeping the screen clean for screenshots and preventing accidental interaction with unrelated windows. - Guard releaseMouseButtons() with CGEventSource.buttonState so that cleanup only sends mouseUp events when buttons are actually pressed, fixing chess-piece deselection caused by spurious mouseUp events. - Use the modern NSRunningApplication.activate() API on macOS 14+ to silence the deprecation warning for activateIgnoringOtherApps. - Add gui-demo.gif showcasing CUA capabilities. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…g support Add a new `gui_batch` tool that executes multiple independent GUI actions in a single call for faster task completion. Key features: - Batch actions: click, type, key, scroll, drag in one tool call - Parallel grounding: N targets grounded concurrently (2.5x speedup) - Unified grounding: multi-target predictor+validator rounds (experimental) - Parallel drag: single-step gui_drag now grounds both endpoints in parallel - Configurable via `[tools.gui] batch_grounding_strategy` or per-call override - One screenshot + one evidence capture per batch instead of N Measured: 10-step batch with 5 click targets: 77s individual → 29s parallel. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…and batch config
- Convert Swift native_helper to a persistent `serve` mode subprocess
using JSON-line protocol over stdin/stdout, eliminating per-call spawn
overhead. Falls back to one-shot mode on connection failure.
- Compress screenshots from PNG to JPEG (quality 75) for 3-5x smaller
payloads and reduced token consumption.
- Fix gui_key("Escape") triggering false emergency stops by adding
suppress_next flag to GuiEmergencyStopMonitor.
- Make gui_batch configurable: add batch_enabled, batch_action_delay_ms
config options; move grounding_strategy from tool param to config.
- Add local-release Cargo profile for fast local builds (no LTO).
- Fix stdout restoration bug in Swift serve error path.
- Fix doc comments for batch_action_delay_ms default value (0, not 3000).
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
BuiltinToolCallturn items for TUI/CLI observability (like exec commands)[tools.gui] coordinate_targeting = true)Demo
Commit breakdown
Feature (9 commits):
BuiltinToolCallItem+BuiltinToolCallStatustypes (+125 lines)gui_tool.rswith all tool schema definitions (+976 lines)gui_coordinate_targetingoption (+95 lines)ItemStarted/ItemCompletedevents for GUI tool calls (+342 lines)BuiltinToolCallCell, CLI human/JSONL output (+494 lines)Fixes & improvements (6 commits):
10. Fix: Do not re-activate app during post-action evidence and wait probes (+9/-2)
11. Fix: Async safety — switch
std::sync::Mutextotokio::sync::Mutex, cache readiness snapshot with 30s TTL, replace hardcodedzshwith/bin/sh, remove dead code in TUI replay (+51/-16)12. Fix: Screenshot redaction crash, TOCTOU race in helper binary check, edge grounding bounds (+97/-24)
13. Refactor: Move blocking platform calls to
spawn_blocking, addCaptureModeenum, improve robustness (+3,685/-3,416)14. Fix: Use modern
activate()API on macOS 14+ (oldactivate(options:)is deprecated), guardreleaseMouseButtonswith button-state checks to avoid spurious events (+12/-4)15. Feature: Hide non-target apps during GUI actions and restore on completion, preventing accidental interaction with unrelated windows and ensuring clean screenshots. Add demo GIF. (+147/-6)
Test plan
gui_toolsfeature, verify GUI tool calls show in TUI like exec commandsCODEX_GUI_GROUNDING_CASE_IDS🤖 Generated with Claude Code