Skip to content

Masihtabaei/reswhis

Repository files navigation

reswhis 🎙️

Important

This project is licensed under the MIT License.

Video

Here you can find an explainer video: Explainer Video on YouTube

Status

ruff-Linting uv

Tech Stack

Python Badge FastAPI Badge uv Badge Ruff Badge Git Badge GitHub Badge

Background

After countless hours of testing various packages, libraries, and frameworks, we realized there was no remote, robust, language-agnostic solution for real-time audio transcription using OpenAI’s Whisper. Existing solutions were usually error-prone, restricted to local use or not easy to install or integrate.

Inspired by the stream socket (in a nutshell direct communication over TCP) server implementation of whisper-streaming, we decided to develop our own Websocket server for Whisper-based streaming transcription.

Main characteristics of our implementation:

  • 🔰 Simple: Done by an undergrad student having simplicity in his head.
  • 🚀 Fast: Thanks to the FastAPI
  • 🌐 Websocket-based: Broader client support (also possible to integrate into web apps without native socket support)
  • 🔀 Parallel server: Capable of handling multiple clients simultaneously.

We all know the struggle of naming a project—it’s almost as hard as the project itself. But every creation deserves a name, and this one is no exception. The name reswhis is a blend of Remote Streaming Whisper.

Requirements

Important: It worths scrolling down to the end of this page, if you got in trouble installing these two requirements.

General and independent requirements:

  1. uv for managing the project, packages and also dependencies
  2. FFmpeg (2024-12-19-git-494c961379-full_build-www.gyan.dev was tested)

Requirements for the faster-whisper backend (recommended for systems engaging Nvidia GPUs):

  1. NVIDIA CUDA Toolkit (version 12.6 Update 3 was tested)
  2. NVIDIA cuDNN Library (version 9.6.0 was tested)

Requirements for the whisper-timestamped backend: Nothing! We took care of all for you.

Requirements for the openai-whisper backend:

  1. An API key from OpenAI

Requirements for using our web client for testing (can get ignored if you develop your own client):

  1. Browser of your choice
  2. A working microphone

Requirements for using our test client on a machine using Microsoft Windows as (can get ignored if you use the web client or prefer your own client):

  1. websocat (v1.14.0 was tested)
  2. A working microphone

Usage

  1. Clone the repository
git clone https://github.com/Masihtabaei/reswhis.git
  1. Change the directory
cd reswhis
  1. Run the uv
uv sync
  1. Open following file in the code editor of your choice:
run.bat
  1. Change the configurations as needed and save the file (more info in the configuration subsection).
  2. Double click the batch file tor run it.

Important: You can also run the server on a machine using Linux or Mac without the batch file. You first need to set the following environment variables (the exact commands depend on the operating system and the exact values depend on your use case [for more info please refer to the configuration subsection]):

BACKEND=<value>
MODEL_SIZE=<value>
LANGUAGE=<value>
SAMPLING_RATE=16000 # Fix value (DO NOT CHANGE)
MINIMUM_CHUNK_SIZE=<value>
USE_VOICE_ACTIVITY_CONTROLLER=<value>
USE_VOICE_ACTIVITY_DETECTION=<value>

The you can run the server directly as follows:

uv run uvicorn main:app
  1. Open the browser of your choice and head to the following address or send a GET HTTP-request to this endpoint using for e. g. curl, Wget or Postman:
protocol://ip:port/info

Important: This REST-endpoint can be used for pinging the server and checking the compatibility of configurations used and specified. If you run the server without changing the default configurations locally and also the port number 8000 is not otherweise bounded, you can use this address:

http://localhost:8000/info

Hurra 🔥! Now you are officially done! You have three options for using this server:

  1. Web-based client
  2. Console-based client
  3. Custom client

For the web-based client:

  1. Change the directory
cd clients
  1. Head to the webpage by opening the following html file:
web_client.html

For the console-based client:

  1. Run the following command to find out name of the microphone you want to use:
ffmpeg -list_devices true -f dshow -i dummy
  1. Use the following command with the microphone's name replaced to start the transcription:
ffmpeg -loglevel debug -f dshow -i audio="<microphone-name>" -ac 1 -ar 16000 -f s16le - | websocat.x86_64-pc-windows-gnu --binary -n ws://localhost:8000/transcribe

For your custom client:

Fill free to use the language, framework or library of choise. However, following points must be considered:

  1. Default sampling rate is 16000 (16 kHz).
  2. Audio should be mono channel.
  3. Data must be transferred as signed 16-bit integer low endian.
  4. /info is an REST-endpoint and /transcribe is a Websocket on.e

Configurations

You can find and modify the following configurations inside the batch file:

  1. BACKEND
faster-whisper, whisper-timestamped, openai-whisper
  1. MODEL_SIZE
tiny.en, tiny, base.en, base, small.en, small, medium.en, medium, large-v1, large-v2, large-v3, large, large-v3-turbo
  1. LANGUAGE
af, am, ar, as, az, ba, be, bg, bn, bo, br, bs, ca, cs, cy, da, de, el, en, es, et, eu, fa, fi, fo, fr, gl, gu, ha, haw, he, hi, hr, ht, hu, hy, id, is, it, ja, jw, ka, kk, km, kn, ko, la, lb, ln, lo, lt, lv, mg, mi, mk, ml, mn, mr, ms, mt, my, ne, nl, nn, no, oc, pa, pl, ps, pt, ro, ru, sa, sd, si, sk, sl, sn, so, sq, sr, su, sv, sw, ta, te, tg, th, tk, tl, tr, tt, uk, ur, uz, vi, yi, yo, zh
  1. SAMPLING_RATE (can NOT be modified currently)
  2. MINIMUM_CHUNK_SIZE $\in \mathbb{N}$ (exlcusive Zero)
  3. USE_VOICE_ACTIVITY_CONTROLLER $\in {True, False}$
  4. USE_VOICE_ACTIVITY_DETECTION $\in {True, False}$

Important: we recommend the the MODEL_SIZE=medium for transcribing audios spoken in the German language.

Possible Problems

Here you can find a list of known errors that we experienced with solutions to fix them. Please note that these are issues that are out of our control (e. g. some 3rd-party propreitary dependencies) and we came up with some custom workarounds.

  1. Could not locate cudnn_ops64_9.dll. Please make sure it is in your library path!Invalid handle. Cannot load symbol cudnnCreateTensorDescriptor We experienced this problem on machines using Microsoft Windows. First stop the server (for example by using CTRL + C). Please run then the copy_cuda_dlls.bat as administrator. It will prompt you about copying required DLLs so that you can get the problem fixed. After that you can go back to step number 5 and continue from there. If you installed the NVIDIA CUDA Toolkit and NVIDIA cuDNN Library in a correct manner and also supported version then it should fix the problem.

Acknowledgement

This project was inspired by:

And employed code from:

About

A lightweight, WebSocket-based server for real-time, remote audio transcription using OpenAI’s Whisper, with integrations for Unity Engine and web.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages