Video recording, live detection, and video playback were being blocked by TTS text responses. The image generation path works correctly, but video paths don't.
- Tool generates image →
setLatestGenImg(path)→ Returns success - LLM generates TTS response
- TTS plays (display shows text)
- AFTER TTS completes,
ChatFlow.getPlayEndPromise()triggers - Checks
getLatestGenImg()→ displays image → switches to "image" flow - Image stays on screen, no TTS can interrupt
- Tool starts process → Starts display interval immediately → Returns success
- LLM generates TTS response
- TTS display updates OVERRIDE video frames ← CONFLICT!
- Video frames keep trying to update but get blocked by TTS
Video/detection tools should follow the same pattern as image generation:
- Tool prepares operation (starts process if needed)
- Tool sets "pending" marker (like
setPendingDetection("marker")) - Tool returns success WITHOUT starting display
- LLM generates (brief!) TTS response
- TTS plays
- AFTER TTS, ChatFlow checks pending markers
- ChatFlow switches to visual flow and starts display updates
- Visual display runs uninterrupted until stopped
Added pending marker functions:
setPendingDetection(marker)/getPendingDetection()setPendingRecording(marker)/getPendingRecording()getVideoPlaybackMarker()(modified to be retrieval-only)hasVisualPending()- checks if any visual mode is pending
Central coordinator for all visual display intervals:
startDetectionDisplay()- Starts detection frame updatesstartRecordingDisplay()- Starts recording preview updatesstartPlaybackDisplay()- Starts video playback updatesstopDetectionDisplay()/stopRecordingDisplay()/stopPlaybackDisplay()stopAllDisplays()- Cleanup helper
- Checks
hasVisualPending()in TTS callbacks to prevent text display - After TTS completes, checks for pending visual markers
- New flows: "detection", "recording", "videoPlayback"
- Each flow calls coordinator to start display, sets up button handlers
Tools need to follow this pattern:
// OLD PATTERN (DON'T DO THIS)
export const startDetectionTool = {
func: async (params) => {
startDetectionProcess();
// ❌ DON'T start interval immediately
detectionInterval = setInterval(() => {
display({ image: FRAME_PATH });
}, 100);
return "[success]Detection started";
}
};
// NEW PATTERN (DO THIS)
import { setPendingDetection } from "../../utils/image";
export const startDetectionTool = {
func: async (params) => {
// Start the process (it will write frames to temp files)
startDetectionProcess();
// ✅ Set pending marker - DON'T start display
setPendingDetection("detection_active");
// Return success - LLM will generate brief TTS
// After TTS, ChatFlow will check marker and start display
return "[success]Starting detection";
}
};Changes needed:
- Import
setPendingDetectionfrom../../utils/image - In
startLiveDetectionfunction:- Keep process spawning logic
- REMOVE the
detectionUpdateInterval = setInterval(...)block - REMOVE the
setLiveDetectionActive(true)call (coordinator handles this) - ADD
setPendingDetection("active")before returning
- In
stopLiveDetectionfunction:- Import and call
stopDetectionDisplay()from visualCoordinator - Keep process cleanup logic
- Import and call
For recordVideoForDuration:
- Import
setPendingRecordingfrom../../utils/image - Keep recording process/command execution
- REMOVE the
previewUpdateInterval = setInterval(...)block - REMOVE the
setVideoRecordingActive(true)call - ADD
setPendingRecording("recording")before returning
For startVideoRecording:
- Same pattern as above
For stopVideoRecording:
- Import and call
stopRecordingDisplay()from visualCoordinator
For playVideo:
- Import
setVideoPlaybackMarker(already exists) - Keep video player process spawn logic
- REMOVE the
playbackUpdateInterval = setInterval(...)block - Call
setVideoPlaybackMarker(VIDEO_FRAME_PATH) - Return success
For stopVideo:
- Import and call
stopPlaybackDisplay()from visualCoordinator
User says: "Start detecting person"
↓
[startLiveDetection tool called]
↓
1. Spawn Python detection process (writes frames to /tmp/)
2. setPendingDetection("active")
3. Return "[success]Starting detection"
↓
[LLM receives success, generates response]
↓
4. LLM: "I'll start detecting people for you"
5. TTS plays (brief, text might show on display)
↓
[TTS completes, getPlayEndPromise() triggers]
↓
6. ChatFlow checks getPendingDetection() → "active"
7. ChatFlow.setCurrentFlow("detection")
↓
[Detection flow activated]
↓
8. startDetectionDisplay() called
9. setInterval starts updating display with frames
10. Frames display continuously, no TTS can interrupt!
↓
User presses button OR says "stop"
↓
11. stopDetectionDisplay() called
12. Clear interval, stop process
13. Return to listening mode
1. Live Detection:
# Say: "Start detecting person"
# Expected: Brief TTS, then detection frames display continuously
# Say anything else while detecting
# Expected: Detection continues, no text overlay2. Video Recording:
# Say: "Record video for 10 seconds"
# Expected: Brief TTS, then preview frames display
# Expected: Recording completes, shows success message briefly3. Video Playback:
# Say: "Play the video"
# Expected: Brief TTS, then video frames play
# Expected: Video plays smoothly without text overlay4. Image Generation (should still work):
# Say: "Draw a cat"
# Expected: TTS plays, then image displays (same as before)- Consistent Pattern: All visual modes follow the same architecture as image generation
- Clean Separation: Tools handle process management, ChatFlow handles display flow
- No Race Conditions: Display only starts after TTS completes
- Centralized Control: visualCoordinator manages all display intervals in one place
- Easy to Debug: Clear flow from tool → pending marker → ChatFlow → coordinator
- Button Handling: Each visual flow properly handles button presses to exit
- ✅
src/utils/image.ts- Added pending marker functions - ✅
src/utils/visualCoordinator.ts- NEW coordinator module - ✅
src/core/ChatFlow.ts- Integrated pending markers and visual flows ⚠️ src/config/custom-tools/live-detection.ts- NEEDS UPDATE (remove intervals, add markers)⚠️ src/config/custom-tools/video.ts- NEEDS UPDATE (remove intervals, add markers)
- Update
live-detection.tsfollowing the pattern above - Update
video.tsfollowing the pattern above - Test each visual mode
- Verify TTS doesn't block video display
- Verify button press exits visual modes cleanly