The Audio Wave Analyzer is a sophisticated waveform visualization and timing extraction tool designed primarily for F5-TTS speech editing workflows. It provides precise timing data for speech regions, making it ideal for preparing audio segments for F5-TTS voice cloning and editing.
The F5-TTS Edit Node requires precise timing data to know which parts of audio to replace. The Audio Wave Analyzer excels at:
- Speech Region Detection: Automatically finds where speech occurs
- Precise Timing: Provides exact start/end times for each speech segment
- Visual Verification: Interactive waveform lets you verify and adjust regions
- Clean Output: Generates timing data in the exact format F5-TTS expects
- Quick Start
- Node Parameters
- Audio Analyzer Options Node
- Interactive Interface
- Interactive Buttons
- Analysis Methods Breakdown
- Region Management
- Advanced Features
- Outputs-Reference
- Load Audio: Drag audio file to interface OR set
audio_filepath OR connect audio input - Choose Method: Select analysis method (
silence,energy,peaks, ormanual) - Click Analyze: Process audio to detect timing regions
- Refine Regions: Add/delete manual regions as needed
- Export: Use timing data output for F5-TTS or other applications
audio_file
-
Purpose: Path to audio file for analysis
-
Format: File path or just filename if in ComfyUI input directory
-
Supported Formats: WAV, MP3, OGG, FLAC, M4A, AAC
Examples:
- "speech_sample.wav"
- "C:/Audio/my_voice.mp3"
- "voices/character_01.flac"
Important
If both audio_file and audio input are provided, audio input takes priority.
analysis_method (DROPDOWN)
-
silence: Detects pauses between speech (best for clean speech) ⭐ Recommended for F5-TTS
-
energy: Analyzes volume changes (good for music/noisy audio)
-
peaks: Finds sharp audio spikes (useful for percussion/effects)
-
manual: Uses only user-defined regions
precision_level & visualization_points
precision_level: Output timing format
- milliseconds:
1.234s⭐ Recommended - seconds:
1.23s(rounded) - samples:
27225 smp(exact)
visualization_points: Waveform detail (500-10000)
- 2000-3000: ⭐ Recommended balance
- 500-1000: Faster, less detail
- 5000-10000: Slower, more detail
audio
-
Purpose: Connect audio from other nodes (takes priority over
audio_file) -
Format: Audio connection from upstream nodes
-
Use Case: Processing generated or processed audio in workflows
Examples:
- Audio from TTS generation nodes
- Processed audio from effects chains
- Real-time audio input streams
options
-
Purpose: Connect Audio Wave Analyzer Options extra node for advanced settings and custom threshold values
-
Default Behavior: Uses sensible defaults if not connected
manual_regions
-
Purpose: Define custom timing regions for analysis
-
Format:
start,end(one per line) -
Features: Bidirectional sync, auto-sorting, works with auto-detection
Examples:
1.5,3.2
4.0,6.8
8.1,10.5
region_labels
-
Purpose: Custom labels for manual regions
-
Format: One label per line (must match number of manual regions)
-
Behavior: Custom labels preserved during sorting, auto-generated labels get renumbered
Examples:
Intro
Verse 1
Chorus
Bridge
export_format
-
f5tts: Simple format for F5-TTS (start,end per line) ⭐ Recommended for F5-TTS
-
json: Full data with confidence, labels, metadata
-
csv: Spreadsheet-compatible format
For advanced control over analysis parameters, use the Audio Analyzer Options node.
silence_threshold
(0.001-1.000, step 0.001)-
Low values (0.001-0.01): Detect very quiet passages
-
Medium values (0.01-0.1): Standard speech pauses
-
High values (0.1-1.0): Only detect significant silences
silence_min_duration
(0.01-5.0s, step 0.01s)Minimum silence length to detect:
-
0.01-0.05s: Detect brief pauses (word boundaries)
-
0.1-0.5s: Standard sentence breaks
-
0.5s+: Only long pauses (paragraph breaks)
invert_silence_regions
(BOOLEAN)- False: Returns silence regions (pauses)
- True: Returns speech regions (inverted detection)
- Use Case: F5-TTS workflows where you need speech segments
energy_sensitivity
(0.1-2.0, step 0.1)-
Low (0.1-0.5): Conservative, fewer boundaries
-
Medium (0.5-1.0): Balanced detection
-
High (1.0-2.0): Aggressive, more boundaries
peak_threshold
(0.001-1.0, step 0.001)Minimum amplitude for peak detection
peak_min_distance
(0.01-1.0s, step 0.01s)Minimum time between detected peaks
peak_region_size
(0.01-1.0s, step 0.01s)Size of region around each detected peak
group_regions_threshold
(0.000-3.000s, step 0.001s)Merge nearby regions within threshold:
- 0.000: No grouping (default)
- 0.1-0.5s: Merge very close regions
- 0.5-3.0s: Aggressive merging
The Audio Analyzer provides a rich interactive interface for precise audio editing.
- Blue waveform: Audio amplitude over time
- Red RMS line: Root Mean Square energy
- Grid lines: Time markers for navigation
- Colored regions: Detected/manual timing regions
- Left click + drag: Select audio region
- Right click: Clear selection
- Double click: Seek to position
- Mouse wheel: Zoom in/out
- Middle mouse + drag: Pan waveform
- CTRL + left/right drag: Pan waveform
- Left click on region: Highlight region (green, persistent)
- Alt + click region: Multi-select for deletion (orange, toggle)
- Alt + click empty: Clear all multi-selections
- Shift + left click: Extend selection
- Drag amplitude labels (±0.8): Scale waveform vertically
- Drag loop markers: Move start/end loop points
- Space: Play/pause
- Arrow keys: Move playhead (±1s)
- Shift + Arrow keys: Move playhead (±10s)
- Home/End: Go to start/end
- Enter: Add selected region
- Delete: Delete highlighted/selected regions
- Shift + Delete: Clear all regions
- Escape: Clear selection
- +/-: Zoom in/out
- 0: Reset zoom and amplitude scale
- L: Set loop from selection
- Shift + L: Toggle looping on/off
- Shift + C: Clear loop markers
The floating speed slider provides advanced playback control:
- Drag within slider for standard speed control
- Real-time audio playback with speed adjustment
- Drag beyond edges: Access extreme speeds (-8x to +8x)
- Acceleration: Further you drag, faster the speed increases
- Negative speeds: Silent backwards playhead movement
- Speed display shows actual value (e.g., "4.25x", "-2.50x")
- Thin gray track line for visual reference
- White vertical bar thumb for precise control
- 📁 Upload Audio: Browse and upload files
- 🔍 Analyze: Process audio with current settings
- ➕ Add Region: Add current selection as region
- 🗑️ Delete Region: Remove highlighted/selected regions
- 🗑️ Clear All: Remove all manual regions (keeps auto-detected)
- 🔻 Set Loop: Set loop markers from selection
- 🔄 Loop ON/OFF: Toggle loop playback mode
- 🚫 Clear Loop: Remove loop markers
- 🔍+ / 🔍-: Zoom in/out
- 🔄 Reset: Reset zoom, amplitude, and speed to defaults
- 📋 Export Timings: Copy timing data to clipboard
🔇 Silence Detection
Best for: Clean speech recordings, voice-overs, podcasts
- Analyzes amplitude levels across the audio
- Identifies regions below silence threshold
- Filters by minimum duration requirement
- Optionally inverts to get speech regions
- Lower threshold: Detects quieter silences
- Shorter min duration: Finds brief pauses
- Invert enabled: Returns speech instead of silence
- F5-TTS preparation (with invert enabled)
- Podcast chapter detection
- Speech segment isolation
- Automatic transcription alignment
⚡ Energy Detection
Best for: Music, noisy audio, variable volume content
- Calculates RMS energy over time windows
- Detects significant energy changes
- Creates regions around transition points
- Higher sensitivity: More word boundaries detected
- Lower sensitivity: Only major transitions
- Music beat detection
- Noisy speech processing
- Dynamic content analysis
- Volume-based segmentation
🏔️ Peak Detection
Best for: Percussion, sound effects, transient-rich audio
- Identifies sharp amplitude peaks
- Creates regions around each peak
- Filters by threshold and minimum distance
- Lower threshold: Detects smaller peaks
- Smaller min distance: Allows closer peaks
- Larger region size: Bigger regions around peaks
- Drum hit isolation
- Sound effect extraction
- Transient analysis
- Rhythmic pattern detection
🖐️ Manual Mode
Best for: Precise custom timing, complex audio structures
- Uses only user-defined regions
- No automatic detection performed
- Full manual control over timing
- Text widget input for precise timing
- Interactive region creation
- Custom labeling support
- Bidirectional sync between interface and text
- Precise speech editing
- Custom audio segmentation
- Music arrangement timing
- Specific interval extraction
➕ Creating Regions
- Choose analysis method (
silence,energy,peaks) - Adjust settings via Options node (optional)
- Click Analyze button
- Regions appear automatically
-
Method 1: Drag to select area → press Enter or click Add Region
-
Method 2: Type in
manual_regionswidget: 1.5,3.2 4.0,6.8 -
Method 3: Use manual mode exclusively
- Use any auto-detection method
- Add manual regions on top
- Both types included in output
- Manual regions persist across analyses
🎨 Region Types & Colors
- Created by user interaction
- Editable and persistent
- Always included in output
- Numbered sequentially (Region 1, Region 2, etc.)
- Gray: Silence regions
- Forest Green: Speech regions (inverted silence)
- Yellow: Energy/word boundaries
- Blue: Peak regions
- Color indicates detection method
- Maintain original type color
- Show grouping information in analysis report
- Created when group threshold > 0
✏️ Editing Regions
- Green highlight: Single region selected (click)
- Orange highlight: Multiple regions selected (Alt+click)
- Yellow selection: Current area selection
- Single deletion: Click region → press Delete
- Multi-deletion: Alt+click multiple → press Delete
- Clear all: Shift+Delete or Clear All button
- Move regions: Edit
manual_regionstext widget - Rename regions: Edit
region_labelstext widget - Re-analyze: Adjust settings → click Analyze
🏷️ Region Properties
- Start time: Region beginning
- End time: Region ending
- Duration: Calculated length
- Confidence: Detection certainty (auto-regions)
- Type: manual, silence, speech, energy, peaks
- Source: Detection method used
- Grouping info: If region was merged
- Auto-generated: Region 1, Region 2, etc.
- Custom: User-defined names
- Detection-based: silence, speech, peak_1, etc.
🔗 Region Grouping
Automatically merge nearby regions to reduce fragmentation.
- Set
group_regions_threshold> 0.000s in Options node - Regions within threshold distance get merged
- Overlapping regions are combined
- Metadata preserved from source regions
- Reduces over-segmentation
- Creates cleaner timing data
- Maintains original region information
- Improves F5-TTS results
🔇 Silence Inversion
Convert silence detection to speech detection for F5-TTS workflows.
- Normal silence detection finds pauses
- Inversion calculates speech regions between pauses
- Output contains only speech segments
- Ideal for voice cloning preparation
🔁 Loop Functionality
Precise playback control for detailed editing.
- Select region → press L or click Set Loop
- Drag purple loop markers to adjust
- Use Shift+L to toggle looping on/off
- Purple markers: Loop start/end points
- Loop status: Shown in interface
- Automatic repeat: When looping enabled
🔀 Bidirectional Sync
Seamless integration between interface and text widgets.
- Type regions in
manual_regionswidget - Click back to interface
- Regions automatically appear
- Add regions via interface
- Text widgets update automatically
- Labels and timing stay synchronized
💾 Caching System
Intelligent performance optimization.
- Analysis results cached based on audio + settings
- Instant results for repeated analyses
- Cache invalidated when parameters change
- Manual regions included in cache key
- Faster repeated processing
- Smooth parameter experimentation
- Reduced computation overhead
The Audio Analyzer provides four outputs for different use cases:
🔊 `processed_audio` (AUDIO)
- Purpose: Passthrough of original audio
- Use Case: Continue audio processing pipeline
- Format: Standard ComfyUI audio tensor
- Notes: Always first output for easy chaining
🕒 `timing_data` (STRING)
- Purpose: Main timing export for external use
- Format: Depends on
export_formatsetting - Precision: Respects
precision_levelsetting
1.500,3.200
4.000,6.800
8.100,10.500
[
{
"start": 1.500,
"end": 3.200,
"label": "speech",
"confidence": 1.00,
"metadata": {"type": "speech"}
}
]start,end,label,confidence,duration
1.500,3.200,speech,1.00,1.700
4.000,6.800,speech,1.00,2.800
📄 `analysis_info` (STRING)
- Purpose: Detailed analysis report
- Content: Statistics, settings, visualization summary
- Use Case: Documentation, debugging, analysis review
Audio Analysis Results
Duration: 10.789s
Sample Rate: 22050 Hz
Analysis Method: silence (inverted to speech regions)
Regions Found: 2
Region Grouping:
Grouping Threshold: 0.250s
Original Regions: 4
Final Regions: 2 (1 grouped, 1 individual)
Regions Merged: 2
Timing Regions:
1. speech: 0.000s - 6.244s (duration: 6.244s, confidence: 1.00)
2. speech: 6.847s - 10.789s (duration: 3.942s, confidence: 1.00) [grouped from 2 regions: speech, speech]
Visualization Summary:
Waveform Points: 2000
Duration: 10.789s
Sample Rate: 22050 Hz
RMS Data Points: 202
✂️ `segmented_audio` (AUDIO)
- Purpose: Audio containing only detected regions
- Process: Extracts and concatenates region audio
- Use Case: F5-TTS training, isolated speech extraction
- Format: Standard ComfyUI audio tensor
- Sort regions by start time
- Extract audio for each region
- Concatenate segments sequentially
- Return as single audio tensor
This comprehensive guide covers all aspects of the Audio Analyzer node. For additional support or feature requests, please refer to the main project documentation or community resources.
















