The speech-llm-speech
project is a ROS-based Conversational AI system designed to process user speech input, analyze it through chatbot APIs, and respond with synthesized speech. The solution is implemented in C++, adheres to ROS standards, and is containerized for seamless deployment using Docker and Docker Compose.
The system consists of three main nodes:
Whisper ASR Node
- Converts speech input to text using Whisper.cpp
- Publishes transcribed text to
/recognized_speech
topic - Handles multiple audio formats including WAV and MP3
Decision Maker Node
- Processes text from
/recognized_speech
topic - Interfaces with multiple LLM APIs (OpenAI, HuggingFace, Ollama)
- Publishes selected response to
/text_to_speak
topic
Google TTS Node
- Converts text from
/text_to_speak
topic to speech - Uses Google Text-to-Speech API
- Outputs synthesized audio through speakers
- Docker and Docker Compose
- ROS2 (tested on Iron)
- Ubuntu 22.04 or higher
-
Install Docker on Ubuntu 22.04
Follow the guide to install Docker:
How to Install and Use Docker on Ubuntu 22.04 -
Perform Post-Installation Steps for Docker
Ensure you complete the post-installation steps as outlined here:
Post-Installation Steps for Docker on Linux -
Install NVIDIA Container Toolkit (For GPU Users)
If you're using a GPU, install the NVIDIA Container Toolkit by referring to:
NVIDIA Container Toolkit Installation Guide -
Install Docker Compose on Ubuntu 22.04
Set up Docker Compose using the instructions here:
How to Install and Use Docker Compose on Ubuntu 22.04 -
Install dependancies for connecting mulitple containers Ensure to complete the following instructions
sudo apt-get install gnome-terminal -y
git clone https://github.com/yourusername/speech-llm-speech.git
cd speech-llm-speech
The local ROS workspace location need to be configured for quick start.
OPENAI_API_KEY=your_key_here # Optional
HF_API_KEY=your_key_here # Optional
OLLAMA_MODEL=qwen:0.5b # Optional
MOCK_MODE=0 # Set to 1 to use mock LLM responses # Optional
Pull the pre-built docker image
docker pull naren200/google_tts_node:v1
docker pull naren200/decision_maker_node:v1
docker pull naren200/whisper_asr_node:v1
Step 1: Ensure all the dependancies are installed as depicted here
Step 2: Please pull the latest image, using commands outlined here
Demo video: https://youtu.be/7YaoBxjnQag
The below commands starts all the docker containers in one go.
./start_all_docker.sh
The generated audio will be following location: google_tts/synthesized_speech.wav
.
To stop the docker container, please follow the below commands. It's recommended to stop the containers if you stumble into any issue.
./stop_docker.sh
Build the docker image using the following command mentioned here
The below commands starts the docker, copies the necessary files inside the docker, builds and sources them for launch. Launch file gets executed automatically through flask.
# For speech-to-text
./start_docker.sh speak
Then, Publish Topic data in another terminal.
docker exec -it $(docker ps -q) bash # Navigate to the active docker image instance
ros2 topic pub /text_to_speak std_msgs/msg/String "data: 'Hello, I am an audio generated by googls text to speech synthesis through ros middleware?'" --once ## Publish a string only once
You can find the generated audio in the file directory mentioned below.
speech-llm-speech/
├── google_tts/
│ ├── output
│ |── synthetic_audio.wav
Note: Please stop the current docker container if you run into any issue. Documented here.
The below commands starts the docker, copies the necessary files inside the docker, builds and sources them for launch. Launch file gets executed automatically through flask.
# For best response finder
./start_docker.sh decide
Then, Publish Topic data in another terminal.
docker exec -it $(docker ps -q) bash # Navigate to the active docker image instance
ros2 topic pub /recognized_speech std_msgs/msg/String "data: 'How to reach eternity during human life?'" --once ## Publish a string only once
Note: Please stop the current docker container if you run into any issue. Documented here.
The below commands starts the docker, copies the necessary files inside the docker, builds and sources them for launch. Launch file gets executed automatically through flask.
# For speech-to-text
./start_docker.sh transcribe
The transcribed audio will be printed in the screen.
Note: Please stop the current docker container if you run into any issue. Documented here.
Please configure the {AUDIO_FILE_NAME}
variable in start_docker.sh
and place the audio file in the below depicted file location.
speech-llm-speech/
├── whisper_asr/
│ ├── samples
│ |── jack.wav
And, set the variable in start_docker.sh
as follows
export AUDIO_FILE_NAME='jack.wav'
Optional: If may run another transcription model through replacement of bash script under whisper_asr/assets
.
speech-llm-speech/
├── whisper_asr/
│ ├── assets
│ |── download-ggml-model.sh
And, mention the name MODEL_NAME
ENV variable of the corresponding bash script under whisper_asr/start_in_docker.sh
file.
export MODEL_NAME=download-ggml-model.sh
You may find new models under this repository: https://github.com/ggerganov/whisper.cpp/tree/master/models
./stop_docker.sh
Pull the pre-built image
docker pull naren200/google_tts_node:v1
docker pull naren200/decision_maker_node:v1
docker pull naren200/whisper_asr_node:v1
docker build -t naren200/decision_maker_node:v1 -f ./Dockerfiles/Dockerfile_decision_maker .
docker build -t naren200/whisper_asr_node:v1 -f ./Dockerfiles/Dockerfile_whisper_asr .
docker build -t naren200/google_tts_node:v1 -f ./Dockerfiles/Dockerfile_google_tts .
- Purpose: This mode is designed for straightforward execution of the system without any development-related overhead.
- How to Use:
Replace
./start_docker.sh <mode>
<mode>
with the desired node (transcribe
,decide
, orspeak
). - Features:
- Automatically sets up and runs the required Docker container.
- Pre-configured for smooth operation with minimal user intervention.
- Suitable for deployment scenarios where you don't need to rebuild or modify the Docker images.
Configuration: Enable developer mode by setting the environment variable:
export DEVELOPER=True
Usage with --developer=true:
./start_docker.sh <mode> --developer=true
This mode attaches you to the running Docker container, allowing you to:
- Run individual nodes or launch files
- Access and modify container files
- Inspect logs and diagnose issues
- Test configurations without rebuilding
Usage with --build=true:
./start_docker.sh <mode> --build=true
This mode:
- Forces a complete rebuild of the Docker image
- Recreates all containers from scratch
- Updates any code changes in the image
- Restarts services with the new build
You can combine both flags if needed:
./start_docker.sh <mode> --developer=true --build=true
This will rebuild the image and then attach you to the container for development.
By abstracting the complexity into start_docker.sh
, the system becomes easier to use in real-world scenarios, reducing the need for technical expertise.
You outlined several difficulties related to integrating Whisper ASR within the system. Here’s a summary:
-
CMake Path Issues:
- The CMake list file did not include necessary directories (e.g., header files, source files).
- Required adding folder paths manually to ensure the build system could locate all dependencies.
-
Whisper.cpp Challenges:
- Ensuring compatibility between source files, header files, and application build configurations.
- Parsing audio files (e.g., MP3) was computationally expensive on a system with limited resources.
- These issues highlight the importance of optimizing the build process and leveraging higher computational power for real-time ASR.
-
Build Automation via Docker:
- The start_docker.sh script plays a pivotal role in automating the build and launch process.
- It copies necessary files into the Docker image, sources the ROS workspace, and runs the desired node or application.
- For development mode, it enables manual interaction and debugging by rebuilding the Docker image if
--build=true
is set.
- Modular Architecture: Each node is containerized separately for better scalability and maintenance. Easily scable and deployable.
- Error Handling: Implements robust error handling for API failures, Thread Failures and audio processing issues.
- Mock API Integration: Allows development without actual API costs for general use case.
- Solve current challenges with
whisper_asr
- Although the integration of
whisper.cpp
into the repository andCMakeLists.txt
was successful, the parsing of audio data in the custom package is not working as intended. The output often appears as question marks, despite the whisper package running without errors. These parsing issues will be investigated and resolved in future iterations.
- Although the integration of
- Optimize Whisper Integration:
- Pre-convert audio files to a compatible format (e.g., WAV) to reduce processing overhead.
- Utilize hardware acceleration (e.g., GPU) for whisper_asr and decision_maker.
- Streamline CMake Configuration:
- Use
target_include_directories
andtarget_link_libraries
for modular and maintainable CMake setups.
- Use
This project is licensed under the MIT License.
- Whisper.cpp for ASR functionality
- Google Cloud TTS for speech synthesis
- ROS2 community for middleware support