Vision Capabilities - Complete Guide

Overview

Your whisplay chatbot now has computer vision capabilities! It can analyze photos and videos using:

Online: GPT-4o (OpenAI Vision API)
Offline: llama3.2-vision (Ollama local model)

Quick Start

1. Setup (One-Time)

./setup-vision.sh

This will:

Check if Ollama is installed
Offer to install llama3.2-vision:11b (~7GB)
Configure vision tools

2. Use It!

Take and analyze a photo:

You: "Take a picture"
Bot: "I've taken a picture!"

You: "What do you see in the picture?"
Bot: [Analyzes with GPT-4o or llama3.2-vision]
     "I see a person sitting at a desk with a laptop..."

Record and analyze video:

You: "Record a video for 5 seconds"
Bot: [Records video with live preview]

You: "What's in the video?"
Bot: [Extracts frame and analyzes]
     "I can see..."

Available Tools

1. `analyzeImage`

Analyzes the most recent photo (camera or AI-generated).

Voice Commands:

"What do you see in the picture?"
"Describe the image"
"What objects are in the photo?"
"Is there any text in the image?"
"What colors are in the picture?"

Parameters:

question (string): What to analyze about the image

Example Questions:

"What do you see?"
"Count how many people are in the image"
"Is there a cat or dog in the picture?"
"What is the person doing?"
"Read any text visible in the image"
"What's the main color scheme?"

2. `analyzeVideoFrame`

Extracts a frame from the most recent video and analyzes it.

Voice Commands:

"What's in the video?"
"Describe what happened in the video"
"What did you see in the recording?"

Parameters:

question (string): What to analyze about the video

Note: Currently analyzes frame 15 (about 0.5 seconds into the video). Future: analyze multiple frames.

Technical Details

Online Mode (GPT-4o)

When: WiFi connected Model: gpt-4o Features:

Excellent object recognition
Text reading (OCR)
Scene understanding
People/face detection
Color analysis
Spatial relationships

How it works:

Image converted to base64
Sent to OpenAI API with prompt
GPT-4o vision model analyzes
Response returned

Offline Mode (llama3.2-vision)

When: No WiFi Model: llama3.2-vision:11b (default) Size: ~7GB Features:

Good object recognition
Scene description
Basic OCR
Color detection

Alternative models:

llama3.2-vision:90b - Better quality, needs GPU, ~55GB

How it works:

Image converted to base64
Sent to local Ollama API
Local vision model analyzes
Response returned

Installation

Install Ollama Vision Model

# Recommended: 11B model (7GB, CPU-friendly)
ollama pull llama3.2-vision:11b

# Or larger: 90B model (55GB, needs GPU)
ollama pull llama3.2-vision:90b

Set Custom Model

Edit .env:

OLLAMA_VISION_MODEL=llama3.2-vision:11b
# or
OLLAMA_VISION_MODEL=llama3.2-vision:90b

Usage Examples

Example 1: Simple Analysis

You: "Take a picture"
Bot: "I've taken a picture!"

You: "What do you see?"
Bot: "I can see a workspace with a computer monitor displaying code,
     a keyboard, a mouse, and a coffee mug on a wooden desk."

Example 2: Specific Questions

You: "Take a picture of my hand"
Bot: "I've taken a picture!"

You: "How many fingers am I holding up?"
Bot: "You're holding up three fingers."

Example 3: Text Reading

You: "Take a picture of that sign"
Bot: "I've taken a picture!"

You: "Read the text in the image"
Bot: "The sign says 'Welcome to our store - Open 9am to 5pm'"

Example 4: Video Analysis

You: "Record a 5 second video of me waving"
Bot: [Records video]

You: "What am I doing in the video?"
Bot: "In the video frame, you appear to be waving your hand
     in a friendly greeting gesture."

Example 5: Object Detection

You: "Take a picture"
Bot: "I've taken a picture!"

You: "Are there any animals in the picture?"
Bot: "Yes, I can see a cat sitting on the couch."

File Structure

Vision Tool:

src/config/custom-tools/vision.ts - Vision tools implementation

Dependencies:

GPT-4o: Already included in OpenAI API
Ollama: Install vision model separately

Image Sources:

Camera photos: data/images/camera-*.jpg
AI generated: data/images/*-image-*.jpg
Video frames: Extracted temporarily to /tmp/

Conversation Flow

1. User takes photo/video
   ↓
2. User asks about it
   ↓
3. LLM detects vision tool needed
   ↓
4. Tool finds latest image/video
   ↓
5. Image sent to vision model
   (GPT-4o online or Ollama offline)
   ↓
6. Analysis result returned
   ↓
7. LLM incorporates in response
   ↓
8. TTS speaks answer

Performance

GPT-4o (Online)

Speed: ~2-3 seconds
Quality: Excellent
Cost: ~$0.01 per image
Limits: API rate limits apply

llama3.2-vision:11b (Offline)

Speed: ~10-15 seconds (CPU)
Speed: ~3-5 seconds (GPU)
Quality: Good
Cost: Free (local)
Limits: RAM/CPU dependent

Advanced Usage

Analyze Specific Image

By default, tools analyze the most recent image. To analyze a specific image, modify the tool to accept an image path parameter.

Analyze Multiple Video Frames

Currently analyzes frame 15. To analyze multiple frames:

// Extract frames at different timestamps
for (let frameNum of [15, 45, 75, 105]) {
  const extractCmd = `ffmpeg -i ${video} -vf "select=eq(n\\,${frameNum})" -vframes 1 frame_${frameNum}.jpg`;
  // Analyze each frame
}

Batch Analysis

Analyze multiple images in sequence:

const allImages = getRecentImages(5); // Last 5 images
for (const img of allImages) {
  const result = await analyzeImage(img, question);
  // Process results
}

Troubleshooting

Vision Model Not Found

Problem: Ollama vision model not available

Solution:

ollama pull llama3.2-vision:11b

Slow Analysis (Offline)

Problem: Takes 15+ seconds to analyze

Solutions:

Use smaller model: llama3.2-vision:11b instead of :90b
Use GPU if available
Close other applications
Ensure sufficient RAM (8GB+ recommended)

OpenAI Rate Limit

Problem: Rate limit exceeded

Solution:

Wait a moment and try again
Use offline mode (disconnect WiFi temporarily)

No Images Found

Problem: No images found. Take a photo first.

Solution:

Take a photo: "Take a picture"
Or generate one: "Draw me a sunset"
Check data/images/ directory has files

Environment Variables

# .env file

# Vision model for offline mode
OLLAMA_VISION_MODEL=llama3.2-vision:11b

# Ollama endpoint (default: localhost)
OLLAMA_ENDPOINT=http://localhost:11434

# OpenAI model (GPT-4o has vision)
OPENAI_LLM_MODEL=gpt-4o

Future Enhancements

Planned improvements:

Multi-frame video analysis - Analyze entire video timeline
Real-time analysis - Analyze preview frames during recording
Object tracking - Track objects across video frames
Face recognition - Identify specific people
Scene change detection - Detect cuts/transitions in video
Text extraction - Full OCR with position data
Image comparison - Compare before/after photos
Batch processing - Analyze all photos in directory

Model Comparison

Feature	GPT-4o	llama3.2-vision:11b	llama3.2-vision:90b
Speed	⭐⭐⭐	⭐⭐	⭐
Quality	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐
OCR	⭐⭐⭐⭐⭐	⭐⭐	⭐⭐⭐
Objects	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐
Scenes	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐
Size	API	7GB	55GB
Cost	Paid	Free	Free
Internet	Required	No	No

Summary

What's Added:

✅ 2 vision tools (analyzeImage, analyzeVideoFrame)
✅ GPT-4o vision support (online)
✅ Ollama vision support (offline)
✅ Automatic model selection based on WiFi
✅ Photo and video analysis
✅ Setup script for easy installation

Usage:

# Setup
./setup-vision.sh

# Restart
systemctl --user restart whisplay.service

# Try it!
"Take a picture"
"What do you see?"

Commands:

Take photo → Analyze
Record video → Analyze frame
Ask specific questions about visual content

Enjoy your new vision capabilities! 👁️📸🎥

FilesExpand file tree

VISION_GUIDE.md

Latest commit

History

VISION_GUIDE.md

File metadata and controls

Vision Capabilities - Complete Guide

Overview

Quick Start

1. Setup (One-Time)

2. Use It!

Available Tools

1. analyzeImage

2. analyzeVideoFrame

Technical Details

Online Mode (GPT-4o)

Offline Mode (llama3.2-vision)

Installation

Install Ollama Vision Model

Set Custom Model

Usage Examples

Example 1: Simple Analysis

Example 2: Specific Questions

Example 3: Text Reading

Example 4: Video Analysis

Example 5: Object Detection

File Structure

Conversation Flow

Performance

GPT-4o (Online)

llama3.2-vision:11b (Offline)

Advanced Usage

Analyze Specific Image

Analyze Multiple Video Frames

Batch Analysis

Troubleshooting

Vision Model Not Found

Slow Analysis (Offline)

OpenAI Rate Limit

No Images Found

Environment Variables

Future Enhancements

Model Comparison

Summary

1. `analyzeImage`

2. `analyzeVideoFrame`