Whisper ROS1 Node

This repository contains a ROS1 node for real-time speech-to-text transcription using the Whisper model. The node subscribes to an audio stream and publishes the transcribed text to a specified topic. This project is built on top of whisper.cpp by Georgi Gerganov, which provides an efficient implementation of the Whisper model in C++.

Demo

whisper_ros.mp4

Prerequisites

g++-10: Ensure you have g++ version 10 or higher installed on your system.
```
sudo apt-get install g++-10
```
If you already have a different version of g++ installed and want to use g++-10 as the default, you can update the alternatives:
```
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-10 100
sudo update-alternatives --config g++
```
Note for Ubuntu 18.04 (Bionic Beaver) Users:
The default repositories for Ubuntu 18.04 do not include g++-10. To install g++-10, you need to add the Ubuntu Toolchain PPA:
```
sudo apt update
sudo apt install software-properties-common
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt update
sudo apt install g++-10
```
After installation, set g++-10 as the default compiler:
```
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-10 100
```

Installation

Clone the Repository:

cd ~/catkin_ws/src
git clone https://github.com/zzl410/whisper_ros1.git
cd whisper_ros1

Build the Whisper Library:

Before running catkin_make, you need to compile the Whisper library:

# Enter the whisper source directory
cd whisper_ros1/lib/whisper.cpp
# Create and enter the build directory
mkdir build
cd build
# Configure the project
cmake ..
# Build the library
make
# Install the library (requires sudo)
sudo make install

Download the Whisper Model:

The Whisper model needs to be downloaded and placed in the appropriate directory. You can use the provided script to download the medium model:
```
sh ./lib/whisper.cpp/models/download-ggml-model.sh medium
```
Alternatively, you can manually download the model from the following locations:
- Hugging Face
- GGML
Update the Model Path:

Edit the ./launch/whisper.launch file to point to the correct path of the downloaded model:
```
<param name="model" value="/path/to/ggml-medium.bin" />
```
Replace /path/to/ggml-medium.bin with the actual path to your model file.
Build the ROS Workspace:
```
cd ~/catkin_ws
catkin_make
```
Source the Workspace:
```
source devel/setup.bash
```

Building a ROS Node for Audio Publishing

To build a ROS node that publishes audio data, you can refer to the following Git repository for guidance:

ROS Audio Publisher Node

Running the Node

To start the Whisper ROS1 node, run the following command:

roslaunch whisper_ros1 whisper.launch

Node Configuration

The node is configured via the whisper.launch file. Below are the key parameters you can adjust:

Parameter	Description
`n_threads`	Number of threads to use for processing.
`step_ms`	Audio step length in milliseconds.
`length_ms`	Length of audio to process in milliseconds.
`keep_ms`	Length of audio to keep in the buffer in milliseconds.
`capture_id`	Audio device ID.
`max_tokens`	Maximum number of tokens in the transcription.
`audio_ctx`	Audio context length.
`vad_thold`	VAD (Voice Activity Detection) threshold.
`freq_thold`	Frequency threshold.
`translate`	Whether to translate the audio to English.
`no_fallback`	Disable fallback to smaller models.
`print_special`	Print special tokens.
`no_context`	Disable context from previous audio.
`no_timestamps`	Disable timestamps in the transcription.
`tinydiarize`	Enable tinydiarize mode.
`save_audio`	Save the processed audio.
`use_gpu`	Enable GPU acceleration (if supported).
`flash_attn`	Enable flash attention.
`language`	Language of the audio (e.g., "zh" for Chinese).
`rosnode_name`	Name of the ROS node.
`subscriber_topic`	Topic to subscribe to for audio input.
`publisher_topic`	Topic to publish the transcribed text.
`fname_out`	Output file name (if saving transcription to a file).

VAD (Voice Activity Detection)

The node uses a basic VAD detector to determine when to transcribe audio. The vad_thold parameter controls the sensitivity of the VAD. A higher value will make the detector more sensitive to silence. A value around 0.6 is generally recommended, but you may need to tune it for your specific use case.

When silence is detected, the node will transcribe the last --length milliseconds of audio and output a transcription block suitable for parsing.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

Whisper model by OpenAI.
whisper.cpp by Georgi Gerganov.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
include		include
launch		launch
lib/whisper.cpp		lib/whisper.cpp
src		src
.gitignore		.gitignore
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
package.xml		package.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Whisper ROS1 Node

Demo

Prerequisites

Installation

Building a ROS Node for Audio Publishing

Running the Node

Node Configuration

VAD (Voice Activity Detection)

License

Acknowledgments

About

Releases

Packages

Languages

License

zzl410/whisper_ros1

Folders and files

Latest commit

History

Repository files navigation

Whisper ROS1 Node

Demo

Prerequisites

Installation

Building a ROS Node for Audio Publishing

Running the Node

Node Configuration

VAD (Voice Activity Detection)

License

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages