Skip to content

A ROS2-based Conversational AI system that processes speech input, interacts with various chatbot APIs (including ChatGPT, Hugging Face, and local LLMs via Ollama), and generates spoken responses using independently Dockerized ROS2 nodes.

Notifications You must be signed in to change notification settings

naren200/speech-llm-speech

Repository files navigation

Speech-LLM-Speech System

Overview

The speech-llm-speech project is a ROS-based Conversational AI system designed to process user speech input, analyze it through chatbot APIs, and respond with synthesized speech. The solution is implemented in C++, adheres to ROS standards, and is containerized for seamless deployment using Docker and Docker Compose.

System Architecture

The system consists of three main nodes:

Whisper ASR Node

  • Converts speech input to text using Whisper.cpp
  • Publishes transcribed text to /recognized_speech topic
  • Handles multiple audio formats including WAV and MP3

Decision Maker Node

  • Processes text from /recognized_speech topic
  • Interfaces with multiple LLM APIs (OpenAI, HuggingFace, Ollama)
  • Publishes selected response to /text_to_speak topic

Google TTS Node

  • Converts text from /text_to_speak topic to speech
  • Uses Google Text-to-Speech API
  • Outputs synthesized audio through speakers

Prerequisites

  • Docker and Docker Compose
  • ROS2 (tested on Iron)
  • Ubuntu 22.04 or higher

Dependencies Installation Guide

  1. Install Docker on Ubuntu 22.04
    Follow the guide to install Docker:
    How to Install and Use Docker on Ubuntu 22.04

  2. Perform Post-Installation Steps for Docker
    Ensure you complete the post-installation steps as outlined here:
    Post-Installation Steps for Docker on Linux

  3. Install NVIDIA Container Toolkit (For GPU Users)
    If you're using a GPU, install the NVIDIA Container Toolkit by referring to:
    NVIDIA Container Toolkit Installation Guide

  4. Install Docker Compose on Ubuntu 22.04
    Set up Docker Compose using the instructions here:
    How to Install and Use Docker Compose on Ubuntu 22.04

  5. Install dependancies for connecting mulitple containers Ensure to complete the following instructions

sudo apt-get install gnome-terminal -y

Quick Start

1. Clone the repository:

git clone https://github.com/yourusername/speech-llm-speech.git
cd speech-llm-speech

2. Configue the following environment variables in start_docker.sh file:

The local ROS workspace location need to be configured for quick start.

OPENAI_API_KEY=your_key_here # Optional
HF_API_KEY=your_key_here # Optional
OLLAMA_MODEL=qwen:0.5b # Optional
MOCK_MODE=0  # Set to 1 to use mock LLM responses # Optional

3. Pull docker image

Pull the pre-built docker image

docker pull naren200/google_tts_node:v1
docker pull naren200/decision_maker_node:v1
docker pull naren200/whisper_asr_node:v1

4a. Test: Multiple containers in one docker compose. ROS2 communication between containers

Step 1: Ensure all the dependancies are installed as depicted here

Step 2: Please pull the latest image, using commands outlined here

Demo video: https://youtu.be/7YaoBxjnQag

The below commands starts all the docker containers in one go.

./start_all_docker.sh

The generated audio will be following location: google_tts/synthesized_speech.wav.

To stop the docker container, please follow the below commands. It's recommended to stop the containers if you stumble into any issue.

./stop_docker.sh

4b. Test out the system separately:

Build the docker image using the following command mentioned here

start node: google_tts

The below commands starts the docker, copies the necessary files inside the docker, builds and sources them for launch. Launch file gets executed automatically through flask.

# For speech-to-text
./start_docker.sh speak

Then, Publish Topic data in another terminal.

docker exec -it $(docker ps -q) bash # Navigate to the active docker image instance
ros2 topic pub /text_to_speak std_msgs/msg/String "data: 'Hello, I am an audio generated by googls text to speech synthesis through ros middleware?'" --once ## Publish a string only once

You can find the generated audio in the file directory mentioned below.

speech-llm-speech/
├── google_tts/
│   ├── output
│       |── synthetic_audio.wav

https://youtu.be/qFFDkh0DOK8

Note: Please stop the current docker container if you run into any issue. Documented here.

start node: decision_maker

The below commands starts the docker, copies the necessary files inside the docker, builds and sources them for launch. Launch file gets executed automatically through flask.

# For best response finder
./start_docker.sh decide

Then, Publish Topic data in another terminal.

docker exec -it $(docker ps -q) bash # Navigate to the active docker image instance
ros2 topic pub /recognized_speech std_msgs/msg/String "data: 'How to reach eternity during human life?'" --once ## Publish a string only once

https://youtu.be/MSFU5G0aQJo

Note: Please stop the current docker container if you run into any issue. Documented here.

start node: whisper_asr

The below commands starts the docker, copies the necessary files inside the docker, builds and sources them for launch. Launch file gets executed automatically through flask.

# For speech-to-text
./start_docker.sh transcribe

The transcribed audio will be printed in the screen.

Note: Please stop the current docker container if you run into any issue. Documented here.

Extra Configuration: whisper_asr

Please configure the {AUDIO_FILE_NAME} variable in start_docker.sh and place the audio file in the below depicted file location.

speech-llm-speech/
├── whisper_asr/
│   ├── samples
│       |── jack.wav

And, set the variable in start_docker.sh as follows

export AUDIO_FILE_NAME='jack.wav'

Optional: If may run another transcription model through replacement of bash script under whisper_asr/assets.

speech-llm-speech/
├── whisper_asr/
│   ├── assets
│       |── download-ggml-model.sh

And, mention the name MODEL_NAME ENV variable of the corresponding bash script under whisper_asr/start_in_docker.sh file.

export MODEL_NAME=download-ggml-model.sh

You may find new models under this repository: https://github.com/ggerganov/whisper.cpp/tree/master/models


Stop any container running in the background

./stop_docker.sh

Docker Image

Quick start

Pull the pre-built image

docker pull naren200/google_tts_node:v1
docker pull naren200/decision_maker_node:v1
docker pull naren200/whisper_asr_node:v1

Steps to build Docker Image from scratch

docker build -t naren200/decision_maker_node:v1 -f ./Dockerfiles/Dockerfile_decision_maker . 
docker build -t naren200/whisper_asr_node:v1 -f ./Dockerfiles/Dockerfile_whisper_asr . 
docker build -t naren200/google_tts_node:v1 -f ./Dockerfiles/Dockerfile_google_tts . 

Code Structure and Methods

Build Modes

1. Normal Mode

  • Purpose: This mode is designed for straightforward execution of the system without any development-related overhead.
  • How to Use:
    ./start_docker.sh <mode>
    Replace <mode> with the desired node (transcribe, decide, or speak).
  • Features:
    • Automatically sets up and runs the required Docker container.
    • Pre-configured for smooth operation with minimal user intervention.
    • Suitable for deployment scenarios where you don't need to rebuild or modify the Docker images.

2. Developer Mode

Configuration: Enable developer mode by setting the environment variable:

export DEVELOPER=True

Usage with --developer=true:

./start_docker.sh <mode> --developer=true

This mode attaches you to the running Docker container, allowing you to:

  • Run individual nodes or launch files
  • Access and modify container files
  • Inspect logs and diagnose issues
  • Test configurations without rebuilding

3. Build Mode

Usage with --build=true:

./start_docker.sh <mode> --build=true

This mode:

  • Forces a complete rebuild of the Docker image
  • Recreates all containers from scratch
  • Updates any code changes in the image
  • Restarts services with the new build

You can combine both flags if needed:

./start_docker.sh <mode> --developer=true --build=true

This will rebuild the image and then attach you to the container for development.

4. Business Model Scalability:

By abstracting the complexity into start_docker.sh, the system becomes easier to use in real-world scenarios, reducing the need for technical expertise.


Challenges with CMake and Whisper.cpp Integration

You outlined several difficulties related to integrating Whisper ASR within the system. Here’s a summary:

  1. CMake Path Issues:

    • The CMake list file did not include necessary directories (e.g., header files, source files).
    • Required adding folder paths manually to ensure the build system could locate all dependencies.
  2. Whisper.cpp Challenges:

    • Ensuring compatibility between source files, header files, and application build configurations.
    • Parsing audio files (e.g., MP3) was computationally expensive on a system with limited resources.
    • These issues highlight the importance of optimizing the build process and leveraging higher computational power for real-time ASR.
  3. Build Automation via Docker:

    • The start_docker.sh script plays a pivotal role in automating the build and launch process.
    • It copies necessary files into the Docker image, sources the ROS workspace, and runs the desired node or application.
    • For development mode, it enables manual interaction and debugging by rebuilding the Docker image if --build=true is set.

Critical Design Choices

  1. Modular Architecture: Each node is containerized separately for better scalability and maintenance. Easily scable and deployable.
  2. Error Handling: Implements robust error handling for API failures, Thread Failures and audio processing issues.
  3. Mock API Integration: Allows development without actual API costs for general use case.

Plans to improve

  • Solve current challenges with whisper_asr
    • Although the integration of whisper.cpp into the repository and CMakeLists.txt was successful, the parsing of audio data in the custom package is not working as intended. The output often appears as question marks, despite the whisper package running without errors. These parsing issues will be investigated and resolved in future iterations.
  • Optimize Whisper Integration:
    • Pre-convert audio files to a compatible format (e.g., WAV) to reduce processing overhead.
    • Utilize hardware acceleration (e.g., GPU) for whisper_asr and decision_maker.
  • Streamline CMake Configuration:
    • Use target_include_directories and target_link_libraries for modular and maintainable CMake setups.

License

This project is licensed under the MIT License.

Acknowledgments

  • Whisper.cpp for ASR functionality
  • Google Cloud TTS for speech synthesis
  • ROS2 community for middleware support

Contributors

Star History

Star History Chart

About

A ROS2-based Conversational AI system that processes speech input, interacts with various chatbot APIs (including ChatGPT, Hugging Face, and local LLMs via Ollama), and generates spoken responses using independently Dockerized ROS2 nodes.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages