🇬🇧 English | 🇰🇷 한국어 | 🇨🇳 中文 | 🇯🇵 日本語
Real-time speech recognition and translation overlay for macOS.
Captures system audio, transcribes speech using Apple's Speech framework, and displays translated subtitles in a floating overlay window. Works with any audio source — YouTube, podcasts, Zoom/Teams meetings, and more.
This project was created and maintained through AI-assisted development. The code, build scripts, documentation, and CI/CD configuration should be reviewed and tested carefully before production use.
- Real-time system audio capture via ScreenCaptureKit (16kHz mono PCM)
- Speech-to-text using SFSpeechRecognizer (on-device or server-based)
- Live translation via Apple Translation framework — translates text as it's being recognized, not just after finalization
- Dual display modes:
- Combined — single overlay with both recognized and translated text
- Split — separate recognition and translation windows, independently positionable
- Floating overlay — resizable, movable, always-on-top window with customizable appearance
- Lock/Unlock — locked = click-through, unlocked = move/resize/scroll
- Scrollable subtitle history with auto-scroll
- Customizable appearance — separate font size/color for original and translated text, background color/opacity
- Automatic language detection (English, Korean, Japanese, Chinese)
- Smart text processing — sentence-based segmentation, pause detection, duplicate filtering, punctuation cleanup
- Session history recording with export
- Menu bar app — no Dock icon, minimal footprint
- macOS 15.0 (Sequoia) or later
- Apple Silicon (arm64)
- Download
OST.zipfrom the latest release - Unzip and move
OST.appto your Applications folder - If macOS blocks the app on first run:
xattr -dr com.apple.quarantine /Applications/OST.app
Requires Xcode Command Line Tools:
xcode-select --installSee the Build section below for full instructions.
On first launch, macOS may prompt for the following permissions. If not prompted, enable them manually:
| Permission | Purpose | How to Enable |
|---|---|---|
| Screen Recording | System audio capture via ScreenCaptureKit | System Settings > Privacy & Security > Screen & System Audio Recording > Enable OST |
| System Audio Recording | System audio capture permission on macOS 15+ | System Settings > Privacy & Security > Screen & System Audio Recording > Enable OST |
| Speech Recognition | SFSpeechRecognizer access | System Settings > Privacy & Security > Speech Recognition > Enable OST |
If you enable permissions manually in System Settings, restart OST for changes to take effect.
Speech recognition (especially server-based) requires Siri & Dictation to be enabled:
- Open System Settings > Siri & Spotlight
- Turn on Siri (or "Listen for...")
- If using on-device recognition only, Siri does not need to be active — but the speech model must be downloaded (see Step 3)
For faster, offline, and more reliable recognition:
- Open System Settings > General > Keyboard > Dictation
- Under Languages, download the speech model for your source language (e.g., English, Korean, Japanese)
- After download, confirm "On-device recognition" remains enabled in OST Settings > Languages tab
Without the on-device model, server-based recognition is used. This requires internet and may have higher latency.
For offline translation using Apple Translation framework:
- Open System Settings > General > Language & Region > Translation Languages
- Download the language pair you need (e.g., English ↔ Korean)
Without the translation pack, translation will not work offline.
# Clone the repository
git clone https://github.com/9bow/OST.git
cd OST
# Full build → produces build/OST.app
./build.sh
# Type-check only (no binary)
./build.sh --typecheck
# Run project checks
./test.sh
# Clean build
./build.sh --clean
# Run
open build/OST.appNo Xcode project is required. The build script compiles all Swift sources via xcrun swiftc.
./test.sh uses system command-line tools only and runs documentation, workflow, regression, behavioral, and type-check gates.
For release checks that require real macOS permissions, audio capture, Apple Translation language packs, or online fallback network behavior, use docs/manual-qa.md.
If macOS blocks the app on first run, execute:
xattr -dr com.apple.quarantine build/OST.app
- Click the captions bubble icon in the menu bar
- Select source and target languages (or use "Auto" for automatic detection)
- Click Start Capture to begin capturing system audio
- The overlay window(s) will appear with live transcription and translation
| Action | How |
|---|---|
| Lock/Unlock | Menu bar > Lock Overlay, or Settings > Display > Overlay Window |
| Move | Unlock, then drag the overlay window |
| Resize | Unlock, then drag the window edges |
| Scroll | Unlock, then scroll through subtitle history |
| Reset position | Settings > Display > "Reset All Overlay Windows" |
- Locked mode: The overlay is click-through — interact with windows behind it normally
- Unlocked mode: Drag to move, resize edges, scroll through subtitle history. Auto-scrolls to the latest text
Configure in Settings > Display > Mode:
- Combined: Single window showing both original and translated text
- Split: Default mode with two separate windows — recognition (original text) and translation. Each window can be independently positioned and resized. Menu bar Lock/Unlock applies to both windows simultaneously; Settings can lock each window independently
- Speech Pause: Adjust in Settings > Display > "Speech Pause" slider (default 3s). Shorter values finalize text faster; longer values wait for natural sentence endings
- Subtitle Expiry: Old subtitles automatically fade after the configured time (default 20s)
- Max Lines: Control how many subtitle entries are visible at once (default 3)
- Session History: Enabled by default. View past transcription sessions via menu bar > Session History, export them for reference, or disable saving in Settings > Debug
- On-device recognition: Enabled by default. If the selected language model is unavailable or you prefer server-based recognition, disable it in Settings > Languages
- Online fallback translation: Disabled by default. Enable it in Settings > Languages only if you want OST to send text to Google Translate when Apple Translation is unavailable
ScreenCaptureKit (16kHz mono) → SpeechRecognizer → AppState → TranslationService → Overlay Views
SystemAudioCapture SFSpeech entries Translation.framework NSPanel
OST/Sources/
├── App/ AppState, OSTApp, WindowManager, Logger, SessionRecorder
├── Audio/ SystemAudioCapture (ScreenCaptureKit)
├── Speech/ SpeechRecognizer, SupportedLanguages
├── Translation/ TranslationService, TranslationConfig
├── Settings/ UserSettings
├── UI/ SubtitleView, RecognitionOverlayView, TranslationOverlayView,
│ OverlayWindow, MenuBarView, SettingsView, FontSettingsView, etc.
└── Accessibility/ AccessibilityManager
| Problem | Solution |
|---|---|
| No audio captured | Grant Screen Recording and System Audio Recording permissions. If you changed them in System Settings, restart OST |
| Speech recognition not working | Grant Speech Recognition permission; ensure Siri & Dictation is enabled |
| Translation not appearing | Download the translation language pack, or enable online fallback in Settings > Languages if sending text to Google Translate is acceptable |
| Overlay invisible but blocking clicks | Use Settings > Display > "Reset All Overlay Windows" to restore default position |
| macOS blocks the app | Run xattr -dr com.apple.quarantine /Applications/OST.app for an installed app, or xattr -dr com.apple.quarantine build/OST.app for a local build |
| On-device recognition produces no results | Download the speech model for your language in System Settings > Keyboard > Dictation |
- Endpoint detection (EPD) — Speech segmentation uses a pause timer combined with sentence boundary detection, not proper endpoint detection. Subtitle boundaries may sometimes split mid-sentence or merge unrelated phrases.
- Automatic language detection — Auto-detect uses NLLanguageRecognizer on the first ~15 characters, which may misidentify the language from short or ambiguous input. Detection only runs once per session.
- Translation consistency — Translation is triggered per speech segment. Short or fragmented segments may produce less coherent translations.
- Speech recognition restart gap — SFSpeechRecognizer's recognition task expires after ~60 seconds and auto-restarts. Overlap detection minimizes duplicate text, but a brief gap in recognition may still occur.






