VTrim is a lightweight, efficient video analysis and trimming tool. It automatically finds segments containing people or speech and can output a trimmed video instantly—without re-encoding, preserving original quality at blazing speed.
• ⚡ Lossless • 🎥 Professional edit-ready XML • 🔍 AI-powered detection • 🎤 Voice Activity Detection
- 🚀 Fast Analysis: Model caching and batch inference for 50-80% faster processing
- ✂️ Lossless Trimming: FFmpeg stream copy (-c copy) - no quality degradation
- 🎬 Professional XML: Export FCP7 XML for DaVinci Resolve/Premiere Pro
- 🤖 AI Detection: YOLOv8 human detection with configurable sensitivity
- 🎤 Voice Activity Detection: Silero VAD for detecting speech segments
- ⚙️ Flexible Configuration: Centralized config for easy customization
- 📊 JSON Output: Machine-readable results for automation
Install via pip:
pip install vtrimvtrim --input video.mp4
# or use the short form:
vtrim -i video.mp4Output:
{"segments": [{"start": 2.3, "end": 5.8}, {"start": 10.1, "end": 14.7}]}vtrim --input video.mp4
# or use the short form:
vtrim -i video.mp4By default, VTrim runs both human detection (YOLOv8) and voice activity detection (Silero VAD). This ensures comprehensive coverage:
- Segments where people are visible on camera
- Segments where someone is speaking, even if not visible
- Perfect for lectures, meetings, interviews, and podcasts
To disable VAD and use only human detection:
vtrim -i video.mp4 --no-vadvtrim --input your_video.mp4 --output output.mp4
# or use short forms:
vtrim -i your_video.mp4 -o output.mp4- Uses FFmpeg stream copy (-c copy) → no re-encoding, no quality loss.
- Automatically merges nearby detections and adds padding for smooth transitions.
vtrim --input lecture.mp4 --output complete_trim.mp4By default, this keeps segments where either:
- A person is visible on camera (human detection), OR
- Someone is speaking (VAD detection)
Ideal for lectures, meetings, or any content where important audio might occur without visual presence.
To use only human detection (disable VAD):
vtrim --input video.mp4 --no-vad --output human_only.mp4Preserve the full timeline (including gaps) as an FCP7 XML for professional editing:
vtrim --input your_video.mp4 --export-xml timeline.xml
# or:
vtrim -i your_video.mp4 --export-xml timeline.xmlAudio and video are perfectly synchronized and split per segment.
vtrim --input video.mp4 \
--conf-threshold 0.15 \
--output sensitive_trim.mp4Lower threshold = more detections (including false positives)
vtrim --input video.mp4 \
--conf-threshold 0.4 \
--padding 3.0 \
--output conservative_trim.mp4Higher threshold + more padding = fewer, longer segments
vtrim --input video.mp4 \
--gap-tolerance 10.0 \
--output merged_trim.mp4Large gap tolerance merges nearby segments into continuous blocks
vtrim --input podcast.mp4 \
--vad-threshold 0.3 \
--output complete_podcast.mp4Lower VAD threshold captures quieter speech, combined with human detection for comprehensive coverage.
vtrim --input interview.mp4 \
--vad-threshold 0.7 \
--padding 0.5 \
--output focused_interview.mp4Higher VAD threshold ensures only clear speech is added to human-detected segments.
vtrim --input video.mp4 --no-vad --output human_only.mp4Use this when you only want to detect visual presence of people, without audio detection.
vtrim --input your_video.mp4 \
--output output.mp4 \
--export-xml timeline.xml
# or use short forms for faster typing:
vtrim -i your_video.mp4 -o output.mp4 --export-xml timeline.xmlPrint detected time segments to stdout for scripting or integration:
vtrim --input meeting.mp4Output:
{
"segments": [
{ "start": 2.3, "end": 5.8 },
{ "start": 10.1, "end": 14.7 }
]
}from vtrim.analyzer import detect_human
from vtrim.vad_analyzer import detect_speech
from vtrim.segment_utils import merge_segments, apply_padding
from vtrim.ffmpeg_utils import cut_video_with_ffmpeg
from vtrim.xml_export import export_fcp7_xml
from vtrim import Config
# Detect humans
raw_segments = detect_human("video.mp4", conf_threshold=0.25)
# OR detect speech using VAD (can be combined with human detection)
speech_segments = detect_speech("video.mp4", vad_threshold=0.5)
# Combine both detection results
all_segments = raw_segments + speech_segments
# Process segments
merged = merge_segments(raw_segments, gap_tolerance=4.0)
padded = apply_padding(merged, padding=1.0)
# Cut video
cut_video_with_ffmpeg("video.mp4", padded, "output.mp4")
# Export XML
export_fcp7_xml("video.mp4", padded, "timeline.xml", video_duration=120.5)
# Access configuration
print(f"Default threshold: {Config.CONF_THRESHOLD}")
print(f"Default VAD threshold: {Config.VAD_THRESHOLD}")
print(f"Default padding: {Config.PADDING}")| Option | Type | Default | Description |
|---|---|---|---|
--input, -i |
Required | - | Path to input video file |
--output, -o |
Optional | - | Path to save trimmed video |
--export-xml |
String | - | Path to export FCP7 XML |
--no-vad |
Flag | Off | Disable Voice Activity Detection (VAD is enabled by default) |
--conf-threshold |
Float | 0.25 | Detection confidence (0.0-1.0), for human detection |
--vad-threshold |
Float | 0.5 | Speech detection confidence (0.0-1.0) |
--padding |
Float | 1.0 | Seconds added before/after segments |
--gap-tolerance |
Float | 4.0 | Max gap to merge segments |
--verbose |
Flag | Off | Show detailed progress |
📌 Note: By default, VTrim runs both human detection (YOLOv8) and voice activity detection (Silero VAD). Use
--no-vadto disable speech detection if you only want visual presence.
- Model Caching: 50-80% faster on subsequent runs (singleton pattern)
- Batch Inference: 20-30% faster processing (batch size = 4)
- Dynamic Resolution: Automatic video metadata detection
- Enhanced Error Handling: Better validation and error messages
For a 10-minute video at 30 FPS:
- Before: ~3-4 minutes total
- After: ~2-2.5 minutes (with cached model)
All defaults are defined in vtrim/config.py:
from vtrim import Config
# Customize settings
Config.CONF_THRESHOLD = 0.15 # Higher sensitivity
Config.PADDING = 2.0 # More padding
Config.GAP_TOLERANCE = 10.0 # Merge nearby detections
Config.SAMPLE_FPS = 2.0 # Analysis sample rate (2 FPS)
Config.BATCH_SIZE = 4 # Inference batch sizeMachine-readable format for scripting:
{
"segments": [
{"start": 2.3, "end": 5.8},
{"start": 10.1, "end": 14.7}
]
}Compatible with:
- DaVinci Resolve
- Adobe Premiere Pro
- Final Cut Pro 7
Features:
- Full timeline (valid + invalid segments)
- Color-coded clips (blue=keep, gray=skip)
- Synchronized audio/video
- Frame-accurate timing
- Format: MP4 (same as input)
- Codec: Unchanged (stream copy)
- Quality: Lossless (no re-encoding)
Solution: Install FFmpeg:
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt-get install ffmpeg
# Windows
# Download from https://ffmpeg.org/download.htmlSolution: Check that the file path is correct (absolute or relative to current directory).
Solutions:
- Lower
--conf-threshold(e.g., 0.15 for higher sensitivity) - Verify the video actually contains people
- Check that
vtrim/yolov8n.ptmodel file exists
Tips:
- First run downloads the model (one-time delay)
- Subsequent runs are 50-80% faster (model cached)
- Reduce
Config.SAMPLE_FPSfor faster but less accurate analysis
- Test with short videos first: Verify settings before processing long videos
- Keep original backups: Always preserve source files until satisfied
- Use verbose mode for debugging:
vtrim --input video.mp4 --verbose - Combine outputs for flexibility: Generate both trimmed video AND XML timeline
- Python 3.7+
- FFmpeg (must be in PATH)
- Dependencies:
- opencv-python
- ultralytics
- silero-vad
- torch
- torchaudio
- onnxruntime
- setuptools
| Variable | Values | Effect |
|---|---|---|
ANALYZER_PROGRESS_JSON |
"0" (default), "1" | Output progress as JSON to stderr |
Example:
ANALYZER_PROGRESS_JSON=1 vtrim --input video.mp4vtrim/
├── __init__.py # Package initialization, exports Config
├── analyzer.py # Human detection logic
├── vad_analyzer.py # Voice Activity Detection logic
├── cli.py # Command-line interface
├── config.py # Configuration settings
├── ffmpeg_utils.py # FFmpeg video processing
├── model.py # YOLO model loading
├── segment_utils.py # Segment merging/padding
├── xml_export.py # FCP7 XML export
└── yolov8n.pt # Pre-trained YOLO model
- The underlying model is YOLOv8n (PyTorch format), optimized for CPU inference.
- By default, VTrim uses both YOLOv8 (human detection) and Silero VAD (speech detection) for comprehensive coverage.
- Use
--no-vadto disable speech detection if you only need visual presence detection. - Video trimming uses FFmpeg stream copy (
-c copy), so it's fast and lossless—no quality degradation. - Progress updates are printed to
stderrduring analysis (every 5% for known-length videos). - For automation, set the environment variable
ANALYZER_PROGRESS_JSON=1to receive machine-readable progress messages onstderr.
- README.md: This file - comprehensive overview and quick start guide
- CHANGELOG.md: Detailed version history, optimizations, and upgrade notes
For more detailed usage examples and advanced configurations, see the inline documentation in vtrim/config.py and individual module docstrings.
- GitHub: https://github.com/chiaweilee/vtrim
- Issues: https://github.com/chiaweilee/vtrim/issues
- License: Apache License v2
Current version: 0.3.0
See CHANGELOG.md for the latest updates and migration notes.