Know who said what. Automatically.
voicetag is a Python library for speaker diarization and named speaker identification. It combines pyannote.audio for diarization with resemblyzer for speaker embeddings, giving you a single interface to answer: who is speaking, and when?
Enroll speakers once with a few audio samples, then identify them in any recording -- meetings, podcasts, interviews, phone calls.
- ⚡ Dead-simple API -- enroll speakers and identify them in three lines of code
- 🌐 Language agnostic -- works with Hebrew, English, Mandarin, or any spoken language
- 👥 Built-in overlap detection -- flags regions where multiple speakers talk simultaneously
- 🚀 Fast parallel processing -- concurrent embedding computation with configurable thread pools
- ⌨️ CLI tool included -- enroll, identify, and manage profiles from the terminal
- 💾 Save/load speaker profiles -- persist enrolled speakers to disk and reuse across sessions
- ✅ Pydantic result models -- fully typed, validated, immutable result objects
- 💬 Built-in transcription -- plug in OpenAI, Groq, Fireworks, Whisper, or Deepgram to get "who said what"
from voicetag import VoiceTag
vt = VoiceTag()
vt.enroll("Christie", ["christie1.flac", "christie2.flac", "christie3.flac"])
vt.enroll("Mark", ["mark1.flac", "mark2.flac"])
# Identify who spoke when
result = vt.identify("audiobook.flac")
for seg in result.segments:
print(f"{seg.speaker}: {seg.start:.1f}s - {seg.end:.1f}s (confidence: {seg.confidence:.2f})")
# Transcribe: who said what
transcript = vt.transcribe("audiobook.flac", provider="whisper")
print(transcript.full_transcript)Output:
Christie: 0.0s - 2.6s (confidence: 0.85)
Christie: 2.6s - 6.7s (confidence: 0.88)
Christie: 7.0s - 8.1s (confidence: 0.78)
[Christie] Gentlemen, he sat in a hoarse voice. Give me your
[Christie] word of honor that this horrible secret shall forever remain buried amongst ourselves.
[Christie] The two men drew back.
pip install voicetagFor transcription support, install with a provider:
pip install voicetag[openai] # OpenAI Whisper API
pip install voicetag[groq] # Groq (fast Whisper)
pip install voicetag[whisper] # Local Whisper (no API key needed)
pip install voicetag[deepgram] # Deepgram
pip install voicetag[all-stt] # All providersvoicetag requires access to the pyannote.audio speaker diarization model, which is gated behind a HuggingFace license agreement.
- Accept the pyannote model licenses at:
- Create a HuggingFace token at huggingface.co/settings/tokens
- Set the token via environment variable or config:
export HF_TOKEN="hf_your_token_here"Or pass it directly:
from voicetag import VoiceTag, VoiceTagConfig
vt = VoiceTag(config=VoiceTagConfig(hf_token="hf_your_token_here"))For faster processing on CUDA or Apple Silicon:
vt = VoiceTag(config=VoiceTagConfig(device="cuda")) # NVIDIA GPU
vt = VoiceTag(config=VoiceTagConfig(device="mps")) # Apple Siliconvoicetag ships with a full-featured command-line interface.
voicetag enroll "Christie" christie1.flac christie2.flac christie3.flac
voicetag enroll "Mark" mark1.flac mark2.flacvoicetag identify audiobook.flacSpeaker Timeline — audiobook.flac
+-----------+----------+----------+----------+------------+
| Speaker | Start | End | Duration | Confidence |
+-----------+----------+----------+----------+------------+
| Christie | 00:00.00 | 00:02.60 | 00:02.60 | 0.85 |
| Christie | 00:02.60 | 00:06.70 | 00:04.10 | 0.88 |
| Christie | 00:07.00 | 00:08.10 | 00:01.10 | 0.78 |
+-----------+----------+----------+----------+------------+
Summary
Total duration: 8.4s
Speakers: 1
Segments: 3
voicetag transcribe audiobook.flac --provider whisper --language enTranscript — audiobook.flac
+-----------+----------+----------+--------------------------------------------------------------+
| Speaker | Start | End | Text |
+-----------+----------+----------+--------------------------------------------------------------+
| Christie | 00:00.00 | 00:02.60 | Gentlemen, he sat in a hoarse voice. Give me your |
| Christie | 00:02.60 | 00:06.70 | word of honor that this horrible secret shall forever remain |
| | | | buried amongst ourselves. |
| Christie | 00:07.00 | 00:08.10 | The two men drew back. |
+-----------+----------+----------+--------------------------------------------------------------+
Other providers:
voicetag transcribe call.wav --provider openai --language en
voicetag transcribe interview.wav --provider groq --language he
voicetag transcribe meeting.wav --provider deepgramvoicetag profiles list
voicetag profiles remove "Christie"
voicetag providers # list available STT providersvoicetag --help
voicetag identify --help| Option | Description |
|---|---|
--profiles PATH |
Path to speaker profiles file (default: voicetag_profiles.json) |
--output, -o PATH |
Save results as JSON |
--threshold FLOAT |
Similarity threshold override (0.0-1.0) |
--hf-token TEXT |
HuggingFace API token |
--device TEXT |
Torch device: cpu, cuda, mps |
--unknown-only |
Skip speaker matching, just diarize |
The main entry point. Wraps the full diarization + identification pipeline.
from voicetag import VoiceTag, VoiceTagConfig
vt = VoiceTag(config=VoiceTagConfig(...))| Method | Returns | Description |
|---|---|---|
enroll(name, audio_paths) |
SpeakerProfile |
Register a speaker from one or more audio files |
identify(audio_path) |
DiarizationResult |
Run full identification pipeline on an audio file |
save(path) |
None |
Save enrolled speaker profiles to disk |
load(path) |
None |
Load speaker profiles from disk |
remove_speaker(name) |
None |
Remove an enrolled speaker by name |
enrolled_speakers |
list[str] |
Property: list of enrolled speaker names |
transcribe(audio_path, provider, ...) |
TranscriptResult |
Identify speakers and transcribe what they said |
result = vt.transcribe("meeting.wav", provider="openai", language="en")
for seg in result.segments:
print(f"[{seg.speaker}] {seg.text}")
# Full transcript
print(result.full_transcript)
# Group by speaker
for speaker, segments in result.by_speaker.items():
print(f"\n{speaker}:")
for seg in segments:
print(f" {seg.text}")Supported providers: openai, groq, fireworks, whisper (local), deepgram
Configuration model (Pydantic v2, frozen/immutable).
config = VoiceTagConfig(
hf_token="hf_...", # HuggingFace token (or set HF_TOKEN env var)
similarity_threshold=0.75, # min cosine similarity for a match
overlap_threshold=0.5, # min overlap ratio to flag
max_workers=4, # parallel embedding threads
min_segment_duration=0.5, # discard segments shorter than this (seconds)
device="cpu", # "cpu", "cuda", or "mps"
)DiarizationResult -- returned by identify():
| Field | Type | Description |
|---|---|---|
segments |
list[SpeakerSegment | OverlapSegment] |
Ordered timeline of speaker segments |
audio_duration |
float |
Total audio length in seconds |
num_speakers |
int |
Number of distinct speakers detected |
processing_time |
float |
Wall-clock pipeline time in seconds |
SpeakerSegment:
| Field | Type | Description |
|---|---|---|
speaker |
str |
Identified speaker name or "UNKNOWN" |
start |
float |
Start time in seconds |
end |
float |
End time in seconds |
confidence |
float |
Cosine similarity score (0.0-1.0) |
duration |
float |
Property: end - start |
OverlapSegment:
| Field | Type | Description |
|---|---|---|
speakers |
list[str] |
Names of overlapping speakers |
start |
float |
Start time in seconds |
end |
float |
End time in seconds |
speaker |
Literal["OVERLAP"] |
Always "OVERLAP" |
duration |
float |
Property: end - start |
SpeakerProfile:
| Field | Type | Description |
|---|---|---|
name |
str |
Speaker name |
embedding |
list[float] |
256-dimensional mean embedding vector |
num_samples |
int |
Number of audio files used for enrollment |
created_at |
datetime |
UTC timestamp of enrollment |
All exceptions inherit from VoiceTagError:
from voicetag import VoiceTagError
try:
result = vt.identify("audio.wav")
except VoiceTagError as e:
print(f"Error: {e}")| Exception | When |
|---|---|
VoiceTagConfigError |
Invalid config or missing HuggingFace token |
EnrollmentError |
Enrollment fails (no audio, bad format) |
DiarizationError |
Pyannote processing failure |
AudioLoadError |
Audio file not found or unsupported format |
- Podcasts -- automatically label host vs. guest segments for transcription
- Interviews -- separate interviewer and interviewee speech for analysis
- Meeting recordings -- identify who said what in team meetings, generate per-speaker summaries
- Court recordings -- tag judge, attorney, and witness speech segments
- Call centers -- distinguish agent from customer in call recordings for QA
- Media monitoring -- track specific speakers across broadcast recordings
voicetag runs a three-stage pipeline:
Audio File
|
v
1. DIARIZE (pyannote.audio)
"When does each speaker talk?"
-> segments: [(0.0-4.2, SPEAKER_00), (4.5-8.1, SPEAKER_01), ...]
|
v
2. EMBED (resemblyzer)
"What does each speaker sound like?"
-> 256-dim embedding vector per segment (computed in parallel)
|
v
3. MATCH (cosine similarity)
"Which enrolled speaker does this sound like?"
-> Alice (0.92), Bob (0.87), UNKNOWN (below threshold)
|
v
DiarizationResult with named speaker timeline
- Diarize -- pyannote.audio segments the audio into speaker turns with anonymous labels (
SPEAKER_00,SPEAKER_01, etc.) - Embed -- resemblyzer computes a 256-dimensional voice embedding for each segment, running in parallel via a thread pool
- Match -- each embedding is compared against enrolled speaker profiles using cosine similarity. Matches above the threshold get assigned the speaker's name; others are labeled
"UNKNOWN"
Overlap detection runs in parallel with matching, identifying regions where two or more speakers talk simultaneously.
| Feature | voicetag | pyannote alone | WhisperX | Manual labeling |
|---|---|---|---|---|
| Speaker diarization | Yes | Yes | Yes | N/A |
| Named speaker identification | Yes | No | No | Yes |
| Overlap detection | Yes | Yes | No | Varies |
| CLI tool | Yes | No | Yes | N/A |
| Save/load speaker profiles | Yes | N/A | N/A | N/A |
| Language agnostic | Yes | Yes | Yes | Yes |
| Typed result models | Yes (Pydantic) | No | No | N/A |
| Lines of code to identify | 3 | ~30 | ~20 | N/A |
VoiceTagConfig controls all tunable parameters:
| Field | Type | Default | Description |
|---|---|---|---|
hf_token |
Optional[str] |
None |
HuggingFace token. Falls back to HF_TOKEN env var. |
similarity_threshold |
float |
0.75 |
Minimum cosine similarity for a match. Range: (0.0, 1.0). |
overlap_threshold |
float |
0.5 |
Minimum overlap ratio to flag as overlapping speech. |
max_workers |
int |
4 |
Thread count for parallel embedding computation. |
min_segment_duration |
float |
0.5 |
Segments shorter than this (seconds) are discarded. |
device |
str |
"cpu" |
Torch device: "cpu", "cuda", or "mps". |
Token resolution order:
config.hf_token(explicit)HF_TOKENenvironment variable- Raise
VoiceTagConfigErrorwith a link to huggingface.co/settings/tokens
Contributions are welcome! See CONTRIBUTING.md for guidelines on setting up the development environment, running tests, and submitting pull requests.
MIT -- Copyright (c) 2026 voicetag contributors
