This repository contains a ROS1 node for real-time speech-to-text transcription using the Whisper model. The node subscribes to an audio stream and publishes the transcribed text to a specified topic. This project is built on top of whisper.cpp by Georgi Gerganov, which provides an efficient implementation of the Whisper model in C++.
whisper_ros.mp4
-
g++-10: Ensure you have g++ version 10 or higher installed on your system.
sudo apt-get install g++-10
If you already have a different version of g++ installed and want to use g++-10 as the default, you can update the alternatives:
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-10 100 sudo update-alternatives --config g++
Note for Ubuntu 18.04 (Bionic Beaver) Users:
The default repositories for Ubuntu 18.04 do not include g++-10. To install g++-10, you need to add the Ubuntu Toolchain PPA:sudo apt update sudo apt install software-properties-common sudo add-apt-repository ppa:ubuntu-toolchain-r/test sudo apt update sudo apt install g++-10
After installation, set g++-10 as the default compiler:
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-10 100
-
Clone the Repository:
cd ~/catkin_ws/src git clone https://github.com/zzl410/whisper_ros1.git cd whisper_ros1
-
Build the Whisper Library:
Before running
catkin_make
, you need to compile the Whisper library:# Enter the whisper source directory cd whisper_ros1/lib/whisper.cpp # Create and enter the build directory mkdir build cd build # Configure the project cmake .. # Build the library make # Install the library (requires sudo) sudo make install
-
Download the Whisper Model:
The Whisper model needs to be downloaded and placed in the appropriate directory. You can use the provided script to download the medium model:
sh ./lib/whisper.cpp/models/download-ggml-model.sh medium
Alternatively, you can manually download the model from the following locations:
-
Update the Model Path:
Edit the
./launch/whisper.launch
file to point to the correct path of the downloaded model:<param name="model" value="/path/to/ggml-medium.bin" />
Replace
/path/to/ggml-medium.bin
with the actual path to your model file. -
Build the ROS Workspace:
cd ~/catkin_ws catkin_make
-
Source the Workspace:
source devel/setup.bash
To build a ROS node that publishes audio data, you can refer to the following Git repository for guidance:
To start the Whisper ROS1 node, run the following command:
roslaunch whisper_ros1 whisper.launch
The node is configured via the whisper.launch
file. Below are the key parameters you can adjust:
Parameter | Description |
---|---|
n_threads |
Number of threads to use for processing. |
step_ms |
Audio step length in milliseconds. |
length_ms |
Length of audio to process in milliseconds. |
keep_ms |
Length of audio to keep in the buffer in milliseconds. |
capture_id |
Audio device ID. |
max_tokens |
Maximum number of tokens in the transcription. |
audio_ctx |
Audio context length. |
vad_thold |
VAD (Voice Activity Detection) threshold. |
freq_thold |
Frequency threshold. |
translate |
Whether to translate the audio to English. |
no_fallback |
Disable fallback to smaller models. |
print_special |
Print special tokens. |
no_context |
Disable context from previous audio. |
no_timestamps |
Disable timestamps in the transcription. |
tinydiarize |
Enable tinydiarize mode. |
save_audio |
Save the processed audio. |
use_gpu |
Enable GPU acceleration (if supported). |
flash_attn |
Enable flash attention. |
language |
Language of the audio (e.g., "zh" for Chinese). |
rosnode_name |
Name of the ROS node. |
subscriber_topic |
Topic to subscribe to for audio input. |
publisher_topic |
Topic to publish the transcribed text. |
fname_out |
Output file name (if saving transcription to a file). |
The node uses a basic VAD detector to determine when to transcribe audio. The vad_thold
parameter controls the sensitivity of the VAD. A higher value will make the detector more sensitive to silence. A value around 0.6
is generally recommended, but you may need to tune it for your specific use case.
When silence is detected, the node will transcribe the last --length
milliseconds of audio and output a transcription block suitable for parsing.
This project is licensed under the MIT License. See the LICENSE file for details.
- Whisper model by OpenAI.
- whisper.cpp by Georgi Gerganov.