Speech-LLM-Speech System

Overview

The speech-llm-speech project is a ROS-based Conversational AI system designed to process user speech input, analyze it through chatbot APIs, and respond with synthesized speech. The solution is implemented in C++, adheres to ROS standards, and is containerized for seamless deployment using Docker and Docker Compose.

System Architecture

The system consists of three main nodes:

Whisper ASR Node

Converts speech input to text using Whisper.cpp
Publishes transcribed text to /recognized_speech topic
Handles multiple audio formats including WAV and MP3

Decision Maker Node

Processes text from /recognized_speech topic
Interfaces with multiple LLM APIs (OpenAI, HuggingFace, Ollama)
Publishes selected response to /text_to_speak topic

Google TTS Node

Converts text from /text_to_speak topic to speech
Uses Google Text-to-Speech API
Outputs synthesized audio through speakers

Prerequisites

Docker and Docker Compose
ROS2 (tested on Iron)
Ubuntu 22.04 or higher

Dependencies Installation Guide

Install Docker on Ubuntu 22.04
Follow the guide to install Docker:
How to Install and Use Docker on Ubuntu 22.04
Perform Post-Installation Steps for Docker
Ensure you complete the post-installation steps as outlined here:
Post-Installation Steps for Docker on Linux
Install NVIDIA Container Toolkit (For GPU Users)
If you're using a GPU, install the NVIDIA Container Toolkit by referring to:
NVIDIA Container Toolkit Installation Guide
Install Docker Compose on Ubuntu 22.04
Set up Docker Compose using the instructions here:
How to Install and Use Docker Compose on Ubuntu 22.04
Install dependancies for connecting mulitple containers Ensure to complete the following instructions

sudo apt-get install gnome-terminal -y

Quick Start

1. Clone the repository:

git clone https://github.com/yourusername/speech-llm-speech.git
cd speech-llm-speech

2. Configue the following environment variables in `start_docker.sh` file:

The local ROS workspace location need to be configured for quick start.

OPENAI_API_KEY=your_key_here # Optional
HF_API_KEY=your_key_here # Optional
OLLAMA_MODEL=qwen:0.5b # Optional
MOCK_MODE=0  # Set to 1 to use mock LLM responses # Optional

3. Pull docker image

Pull the pre-built docker image

docker pull naren200/google_tts_node:v1
docker pull naren200/decision_maker_node:v1
docker pull naren200/whisper_asr_node:v1

4a. Test: Multiple containers in one docker compose. ROS2 communication between containers

Step 1: Ensure all the dependancies are installed as depicted here

Step 2: Please pull the latest image, using commands outlined here

Demo video: https://youtu.be/7YaoBxjnQag

The below commands starts all the docker containers in one go.

./start_all_docker.sh

The generated audio will be following location: google_tts/synthesized_speech.wav.

To stop the docker container, please follow the below commands. It's recommended to stop the containers if you stumble into any issue.

./stop_docker.sh

4b. Test out the system separately:

Build the docker image using the following command mentioned here

start node: google_tts

The below commands starts the docker, copies the necessary files inside the docker, builds and sources them for launch. Launch file gets executed automatically through flask.

# For speech-to-text
./start_docker.sh speak

Then, Publish Topic data in another terminal.

docker exec -it $(docker ps -q) bash # Navigate to the active docker image instance
ros2 topic pub /text_to_speak std_msgs/msg/String "data: 'Hello, I am an audio generated by googls text to speech synthesis through ros middleware?'" --once ## Publish a string only once

You can find the generated audio in the file directory mentioned below.

speech-llm-speech/
├── google_tts/
│   ├── output
│       |── synthetic_audio.wav

https://youtu.be/qFFDkh0DOK8

Note: Please stop the current docker container if you run into any issue. Documented here.

start node: decision_maker

The below commands starts the docker, copies the necessary files inside the docker, builds and sources them for launch. Launch file gets executed automatically through flask.

# For best response finder
./start_docker.sh decide

Then, Publish Topic data in another terminal.

docker exec -it $(docker ps -q) bash # Navigate to the active docker image instance
ros2 topic pub /recognized_speech std_msgs/msg/String "data: 'How to reach eternity during human life?'" --once ## Publish a string only once

https://youtu.be/MSFU5G0aQJo

Note: Please stop the current docker container if you run into any issue. Documented here.

start node: whisper_asr

The below commands starts the docker, copies the necessary files inside the docker, builds and sources them for launch. Launch file gets executed automatically through flask.

# For speech-to-text
./start_docker.sh transcribe

The transcribed audio will be printed in the screen.

Note: Please stop the current docker container if you run into any issue. Documented here.

Extra Configuration: whisper_asr

Please configure the {AUDIO_FILE_NAME} variable in start_docker.sh and place the audio file in the below depicted file location.

speech-llm-speech/
├── whisper_asr/
│   ├── samples
│       |── jack.wav

And, set the variable in start_docker.sh as follows

export AUDIO_FILE_NAME='jack.wav'

Optional: If may run another transcription model through replacement of bash script under whisper_asr/assets.

speech-llm-speech/
├── whisper_asr/
│   ├── assets
│       |── download-ggml-model.sh

And, mention the name MODEL_NAME ENV variable of the corresponding bash script under whisper_asr/start_in_docker.sh file.

export MODEL_NAME=download-ggml-model.sh

You may find new models under this repository: https://github.com/ggerganov/whisper.cpp/tree/master/models

Stop any container running in the background

./stop_docker.sh

Docker Image

Quick start

Pull the pre-built image

docker pull naren200/google_tts_node:v1
docker pull naren200/decision_maker_node:v1
docker pull naren200/whisper_asr_node:v1

Steps to build Docker Image from scratch

docker build -t naren200/decision_maker_node:v1 -f ./Dockerfiles/Dockerfile_decision_maker . 
docker build -t naren200/whisper_asr_node:v1 -f ./Dockerfiles/Dockerfile_whisper_asr . 
docker build -t naren200/google_tts_node:v1 -f ./Dockerfiles/Dockerfile_google_tts .

Code Structure and Methods

Build Modes

1. Normal Mode

Purpose: This mode is designed for straightforward execution of the system without any development-related overhead.
How to Use:
```
./start_docker.sh <mode>
```
Replace <mode> with the desired node (transcribe, decide, or speak).
Features:
- Automatically sets up and runs the required Docker container.
- Pre-configured for smooth operation with minimal user intervention.
- Suitable for deployment scenarios where you don't need to rebuild or modify the Docker images.

2. Developer Mode

Configuration: Enable developer mode by setting the environment variable:

export DEVELOPER=True

Usage with --developer=true:

./start_docker.sh <mode> --developer=true

This mode attaches you to the running Docker container, allowing you to:

Run individual nodes or launch files
Access and modify container files
Inspect logs and diagnose issues
Test configurations without rebuilding

3. Build Mode

Usage with --build=true:

./start_docker.sh <mode> --build=true

This mode:

Forces a complete rebuild of the Docker image
Recreates all containers from scratch
Updates any code changes in the image
Restarts services with the new build

You can combine both flags if needed:

./start_docker.sh <mode> --developer=true --build=true

This will rebuild the image and then attach you to the container for development.

4. Business Model Scalability:

By abstracting the complexity into start_docker.sh, the system becomes easier to use in real-world scenarios, reducing the need for technical expertise.

Challenges with CMake and Whisper.cpp Integration

You outlined several difficulties related to integrating Whisper ASR within the system. Here’s a summary:

CMake Path Issues:
- The CMake list file did not include necessary directories (e.g., header files, source files).
- Required adding folder paths manually to ensure the build system could locate all dependencies.
Whisper.cpp Challenges:
- Ensuring compatibility between source files, header files, and application build configurations.
- Parsing audio files (e.g., MP3) was computationally expensive on a system with limited resources.
- These issues highlight the importance of optimizing the build process and leveraging higher computational power for real-time ASR.
Build Automation via Docker:
- The start_docker.sh script plays a pivotal role in automating the build and launch process.
- It copies necessary files into the Docker image, sources the ROS workspace, and runs the desired node or application.
- For development mode, it enables manual interaction and debugging by rebuilding the Docker image if --build=true is set.

Critical Design Choices

Modular Architecture: Each node is containerized separately for better scalability and maintenance. Easily scable and deployable.
Error Handling: Implements robust error handling for API failures, Thread Failures and audio processing issues.
Mock API Integration: Allows development without actual API costs for general use case.

Plans to improve

Solve current challenges with whisper_asr
- Although the integration of whisper.cpp into the repository and CMakeLists.txt was successful, the parsing of audio data in the custom package is not working as intended. The output often appears as question marks, despite the whisper package running without errors. These parsing issues will be investigated and resolved in future iterations.
Optimize Whisper Integration:
- Pre-convert audio files to a compatible format (e.g., WAV) to reduce processing overhead.
- Utilize hardware acceleration (e.g., GPU) for whisper_asr and decision_maker.
Streamline CMake Configuration:
- Use target_include_directories and target_link_libraries for modular and maintainable CMake setups.

License

This project is licensed under the MIT License.

Acknowledgments

Whisper.cpp for ASR functionality
Google Cloud TTS for speech synthesis
ROS2 community for middleware support

Contributors

naren200

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.vscode		.vscode
Dockerfiles		Dockerfiles
decision_maker		decision_maker
google_tts		google_tts
whisper.cpp		whisper.cpp
whisper_asr		whisper_asr
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
connect_all_containers.sh		connect_all_containers.sh
connect_to_docker.sh		connect_to_docker.sh
docker-compose.yml		docker-compose.yml
docker-compose_separate.yml		docker-compose_separate.yml
requirements.txt		requirements.txt
start_all_docker.sh		start_all_docker.sh
start_docker.sh		start_docker.sh
stop_docker.sh		stop_docker.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech-LLM-Speech System

Overview

System Architecture

Prerequisites

Dependencies Installation Guide

Quick Start

1. Clone the repository:

2. Configue the following environment variables in `start_docker.sh` file:

3. Pull docker image

4a. Test: Multiple containers in one docker compose. ROS2 communication between containers

4b. Test out the system separately:

start node: google_tts

start node: decision_maker

start node: whisper_asr

Extra Configuration: whisper_asr

Stop any container running in the background

Docker Image

Quick start

Steps to build Docker Image from scratch

Code Structure and Methods

Build Modes

1. Normal Mode

2. Developer Mode

3. Build Mode

4. Business Model Scalability:

Challenges with CMake and Whisper.cpp Integration

Critical Design Choices

Plans to improve

License

Acknowledgments

Contributors

Star History

About

Releases

Packages

Languages

naren200/speech-llm-speech

Folders and files

Latest commit

History

Repository files navigation

Speech-LLM-Speech System

Overview

System Architecture

Prerequisites

Dependencies Installation Guide

Quick Start

1. Clone the repository:

2. Configue the following environment variables in start_docker.sh file:

3. Pull docker image

4a. Test: Multiple containers in one docker compose. ROS2 communication between containers

4b. Test out the system separately:

start node: google_tts

start node: decision_maker

start node: whisper_asr

Extra Configuration: whisper_asr

Stop any container running in the background

Docker Image

Quick start

Steps to build Docker Image from scratch

Code Structure and Methods

Build Modes

1. Normal Mode

2. Developer Mode

3. Build Mode

4. Business Model Scalability:

Challenges with CMake and Whisper.cpp Integration

Critical Design Choices

Plans to improve

License

Acknowledgments

Contributors

Star History

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

2. Configue the following environment variables in `start_docker.sh` file:

Packages