reswhis 🎙️

Important

This project is licensed under the MIT License.

Video

Here you can find an explainer video: Explainer Video on YouTube

Status

Tech Stack

Background

After countless hours of testing various packages, libraries, and frameworks, we realized there was no remote, robust, language-agnostic solution for real-time audio transcription using OpenAI’s Whisper. Existing solutions were usually error-prone, restricted to local use or not easy to install or integrate.

Inspired by the stream socket (in a nutshell direct communication over TCP) server implementation of whisper-streaming, we decided to develop our own Websocket server for Whisper-based streaming transcription.

Main characteristics of our implementation:

🔰 Simple: Done by an undergrad student having simplicity in his head.
🚀 Fast: Thanks to the FastAPI
🌐 Websocket-based: Broader client support (also possible to integrate into web apps without native socket support)
🔀 Parallel server: Capable of handling multiple clients simultaneously.

We all know the struggle of naming a project—it’s almost as hard as the project itself. But every creation deserves a name, and this one is no exception. The name reswhis is a blend of Remote Streaming Whisper.

Requirements

Important: It worths scrolling down to the end of this page, if you got in trouble installing these two requirements.

General and independent requirements:

uv for managing the project, packages and also dependencies
FFmpeg (2024-12-19-git-494c961379-full_build-www.gyan.dev was tested)

Requirements for the faster-whisper backend (recommended for systems engaging Nvidia GPUs):

NVIDIA CUDA Toolkit (version 12.6 Update 3 was tested)
NVIDIA cuDNN Library (version 9.6.0 was tested)

Requirements for the whisper-timestamped backend: Nothing! We took care of all for you.

Requirements for the openai-whisper backend:

An API key from OpenAI

Requirements for using our web client for testing (can get ignored if you develop your own client):

Browser of your choice
A working microphone

Requirements for using our test client on a machine using Microsoft Windows as (can get ignored if you use the web client or prefer your own client):

websocat (v1.14.0 was tested)
A working microphone

Usage

Clone the repository

git clone https://github.com/Masihtabaei/reswhis.git

Change the directory

cd reswhis

Run the uv

uv sync

Open following file in the code editor of your choice:

run.bat

Change the configurations as needed and save the file (more info in the configuration subsection).
Double click the batch file tor run it.

Important: You can also run the server on a machine using Linux or Mac without the batch file. You first need to set the following environment variables (the exact commands depend on the operating system and the exact values depend on your use case [for more info please refer to the configuration subsection]):

BACKEND=<value>
MODEL_SIZE=<value>
LANGUAGE=<value>
SAMPLING_RATE=16000 # Fix value (DO NOT CHANGE)
MINIMUM_CHUNK_SIZE=<value>
USE_VOICE_ACTIVITY_CONTROLLER=<value>
USE_VOICE_ACTIVITY_DETECTION=<value>

The you can run the server directly as follows:

uv run uvicorn main:app

Open the browser of your choice and head to the following address or send a GET HTTP-request to this endpoint using for e. g. curl, Wget or Postman:

protocol://ip:port/info

Important: This REST-endpoint can be used for pinging the server and checking the compatibility of configurations used and specified. If you run the server without changing the default configurations locally and also the port number 8000 is not otherweise bounded, you can use this address:

http://localhost:8000/info

Hurra 🔥! Now you are officially done! You have three options for using this server:

Web-based client
Console-based client
Custom client

For the web-based client:

Change the directory

cd clients

Head to the webpage by opening the following html file:

web_client.html

For the console-based client:

Run the following command to find out name of the microphone you want to use:

ffmpeg -list_devices true -f dshow -i dummy

Use the following command with the microphone's name replaced to start the transcription:

ffmpeg -loglevel debug -f dshow -i audio="<microphone-name>" -ac 1 -ar 16000 -f s16le - | websocat.x86_64-pc-windows-gnu --binary -n ws://localhost:8000/transcribe

For your custom client:

Fill free to use the language, framework or library of choise. However, following points must be considered:

Default sampling rate is 16000 (16 kHz).
Audio should be mono channel.
Data must be transferred as signed 16-bit integer low endian.
/info is an REST-endpoint and /transcribe is a Websocket on.e

Configurations

You can find and modify the following configurations inside the batch file:

BACKEND

faster-whisper, whisper-timestamped, openai-whisper

MODEL_SIZE

tiny.en, tiny, base.en, base, small.en, small, medium.en, medium, large-v1, large-v2, large-v3, large, large-v3-turbo

LANGUAGE

af, am, ar, as, az, ba, be, bg, bn, bo, br, bs, ca, cs, cy, da, de, el, en, es, et, eu, fa, fi, fo, fr, gl, gu, ha, haw, he, hi, hr, ht, hu, hy, id, is, it, ja, jw, ka, kk, km, kn, ko, la, lb, ln, lo, lt, lv, mg, mi, mk, ml, mn, mr, ms, mt, my, ne, nl, nn, no, oc, pa, pl, ps, pt, ro, ru, sa, sd, si, sk, sl, sn, so, sq, sr, su, sv, sw, ta, te, tg, th, tk, tl, tr, tt, uk, ur, uz, vi, yi, yo, zh

SAMPLING_RATE (can NOT be modified currently)
MINIMUM_CHUNK_SIZE $\in \mathbb{N}$ (exlcusive Zero)
USE_VOICE_ACTIVITY_CONTROLLER $\in {True, False}$
USE_VOICE_ACTIVITY_DETECTION $\in {True, False}$

Important: we recommend the the MODEL_SIZE=medium for transcribing audios spoken in the German language.

Possible Problems

Here you can find a list of known errors that we experienced with solutions to fix them. Please note that these are issues that are out of our control (e. g. some 3rd-party propreitary dependencies) and we came up with some custom workarounds.

Could not locate cudnn_ops64_9.dll. Please make sure it is in your library path!Invalid handle. Cannot load symbol cudnnCreateTensorDescriptor We experienced this problem on machines using Microsoft Windows. First stop the server (for example by using CTRL + C). Please run then the copy_cuda_dlls.bat as administrator. It will prompt you about copying required DLLs so that you can get the problem fixed. After that you can go back to step number 5 and continue from there. If you installed the NVIDIA CUDA Toolkit and NVIDIA cuDNN Library in a correct manner and also supported version then it should fix the problem.

Acknowledgement

This project was inspired by:

And employed code from:

https://github.com/ufal/whisper_streaming (heavily in use)
https://github.com/snakers4/silero-vad

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.github		.github
clients		clients
resources		resources
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
config.py		config.py
main.py		main.py
pyproject.toml		pyproject.toml
run.bat		run.bat
run.sh		run.sh
silero_vad_iterator.py		silero_vad_iterator.py
uv.lock		uv.lock
whisper_online.py		whisper_online.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

reswhis 🎙️

Video

Status

Tech Stack

Background

Requirements

Usage

Configurations

Possible Problems

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

reswhis 🎙️

Video

Status

Tech Stack

Background

Requirements

Usage

Configurations

Possible Problems

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages