Your whisplay chatbot now has computer vision capabilities! It can analyze photos and videos using:
- Online: GPT-4o (OpenAI Vision API)
- Offline: llama3.2-vision (Ollama local model)
./setup-vision.shThis will:
- Check if Ollama is installed
- Offer to install llama3.2-vision:11b (~7GB)
- Configure vision tools
Take and analyze a photo:
You: "Take a picture"
Bot: "I've taken a picture!"
You: "What do you see in the picture?"
Bot: [Analyzes with GPT-4o or llama3.2-vision]
"I see a person sitting at a desk with a laptop..."
Record and analyze video:
You: "Record a video for 5 seconds"
Bot: [Records video with live preview]
You: "What's in the video?"
Bot: [Extracts frame and analyzes]
"I can see..."
Analyzes the most recent photo (camera or AI-generated).
Voice Commands:
- "What do you see in the picture?"
- "Describe the image"
- "What objects are in the photo?"
- "Is there any text in the image?"
- "What colors are in the picture?"
Parameters:
question(string): What to analyze about the image
Example Questions:
- "What do you see?"
- "Count how many people are in the image"
- "Is there a cat or dog in the picture?"
- "What is the person doing?"
- "Read any text visible in the image"
- "What's the main color scheme?"
Extracts a frame from the most recent video and analyzes it.
Voice Commands:
- "What's in the video?"
- "Describe what happened in the video"
- "What did you see in the recording?"
Parameters:
question(string): What to analyze about the video
Note: Currently analyzes frame 15 (about 0.5 seconds into the video). Future: analyze multiple frames.
When: WiFi connected
Model: gpt-4o
Features:
- Excellent object recognition
- Text reading (OCR)
- Scene understanding
- People/face detection
- Color analysis
- Spatial relationships
How it works:
- Image converted to base64
- Sent to OpenAI API with prompt
- GPT-4o vision model analyzes
- Response returned
When: No WiFi
Model: llama3.2-vision:11b (default)
Size: ~7GB
Features:
- Good object recognition
- Scene description
- Basic OCR
- Color detection
Alternative models:
llama3.2-vision:90b- Better quality, needs GPU, ~55GB
How it works:
- Image converted to base64
- Sent to local Ollama API
- Local vision model analyzes
- Response returned
# Recommended: 11B model (7GB, CPU-friendly)
ollama pull llama3.2-vision:11b
# Or larger: 90B model (55GB, needs GPU)
ollama pull llama3.2-vision:90bEdit .env:
OLLAMA_VISION_MODEL=llama3.2-vision:11b
# or
OLLAMA_VISION_MODEL=llama3.2-vision:90bYou: "Take a picture"
Bot: "I've taken a picture!"
You: "What do you see?"
Bot: "I can see a workspace with a computer monitor displaying code,
a keyboard, a mouse, and a coffee mug on a wooden desk."
You: "Take a picture of my hand"
Bot: "I've taken a picture!"
You: "How many fingers am I holding up?"
Bot: "You're holding up three fingers."
You: "Take a picture of that sign"
Bot: "I've taken a picture!"
You: "Read the text in the image"
Bot: "The sign says 'Welcome to our store - Open 9am to 5pm'"
You: "Record a 5 second video of me waving"
Bot: [Records video]
You: "What am I doing in the video?"
Bot: "In the video frame, you appear to be waving your hand
in a friendly greeting gesture."
You: "Take a picture"
Bot: "I've taken a picture!"
You: "Are there any animals in the picture?"
Bot: "Yes, I can see a cat sitting on the couch."
Vision Tool:
src/config/custom-tools/vision.ts- Vision tools implementation
Dependencies:
- GPT-4o: Already included in OpenAI API
- Ollama: Install vision model separately
Image Sources:
- Camera photos:
data/images/camera-*.jpg - AI generated:
data/images/*-image-*.jpg - Video frames: Extracted temporarily to
/tmp/
1. User takes photo/video
↓
2. User asks about it
↓
3. LLM detects vision tool needed
↓
4. Tool finds latest image/video
↓
5. Image sent to vision model
(GPT-4o online or Ollama offline)
↓
6. Analysis result returned
↓
7. LLM incorporates in response
↓
8. TTS speaks answer
- Speed: ~2-3 seconds
- Quality: Excellent
- Cost: ~$0.01 per image
- Limits: API rate limits apply
- Speed: ~10-15 seconds (CPU)
- Speed: ~3-5 seconds (GPU)
- Quality: Good
- Cost: Free (local)
- Limits: RAM/CPU dependent
By default, tools analyze the most recent image. To analyze a specific image, modify the tool to accept an image path parameter.
Currently analyzes frame 15. To analyze multiple frames:
// Extract frames at different timestamps
for (let frameNum of [15, 45, 75, 105]) {
const extractCmd = `ffmpeg -i ${video} -vf "select=eq(n\\,${frameNum})" -vframes 1 frame_${frameNum}.jpg`;
// Analyze each frame
}Analyze multiple images in sequence:
const allImages = getRecentImages(5); // Last 5 images
for (const img of allImages) {
const result = await analyzeImage(img, question);
// Process results
}Problem: Ollama vision model not available
Solution:
ollama pull llama3.2-vision:11bProblem: Takes 15+ seconds to analyze
Solutions:
- Use smaller model:
llama3.2-vision:11binstead of:90b - Use GPU if available
- Close other applications
- Ensure sufficient RAM (8GB+ recommended)
Problem: Rate limit exceeded
Solution:
- Wait a moment and try again
- Use offline mode (disconnect WiFi temporarily)
Problem: No images found. Take a photo first.
Solution:
- Take a photo: "Take a picture"
- Or generate one: "Draw me a sunset"
- Check
data/images/directory has files
# .env file
# Vision model for offline mode
OLLAMA_VISION_MODEL=llama3.2-vision:11b
# Ollama endpoint (default: localhost)
OLLAMA_ENDPOINT=http://localhost:11434
# OpenAI model (GPT-4o has vision)
OPENAI_LLM_MODEL=gpt-4oPlanned improvements:
- Multi-frame video analysis - Analyze entire video timeline
- Real-time analysis - Analyze preview frames during recording
- Object tracking - Track objects across video frames
- Face recognition - Identify specific people
- Scene change detection - Detect cuts/transitions in video
- Text extraction - Full OCR with position data
- Image comparison - Compare before/after photos
- Batch processing - Analyze all photos in directory
| Feature | GPT-4o | llama3.2-vision:11b | llama3.2-vision:90b |
|---|---|---|---|
| Speed | ⭐⭐⭐ | ⭐⭐ | ⭐ |
| Quality | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| OCR | ⭐⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐ |
| Objects | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| Scenes | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| Size | API | 7GB | 55GB |
| Cost | Paid | Free | Free |
| Internet | Required | No | No |
What's Added:
- ✅ 2 vision tools (
analyzeImage,analyzeVideoFrame) - ✅ GPT-4o vision support (online)
- ✅ Ollama vision support (offline)
- ✅ Automatic model selection based on WiFi
- ✅ Photo and video analysis
- ✅ Setup script for easy installation
Usage:
# Setup
./setup-vision.sh
# Restart
systemctl --user restart whisplay.service
# Try it!
"Take a picture"
"What do you see?"Commands:
- Take photo → Analyze
- Record video → Analyze frame
- Ask specific questions about visual content
Enjoy your new vision capabilities! 👁️📸🎥