Skip to content

yui-mhcp/speech_to_text

Repository files navigation

πŸ˜‹ Speech To Text (STT)

Check the CHANGELOG file to have a global overview of the latest modifications! πŸ˜‹

Project structure

β”œβ”€β”€ architectures            : utilities for model architectures
β”‚   β”œβ”€β”€ layers               : custom layer implementations
β”‚   β”œβ”€β”€ transformers         : transformer architecture implementations
β”‚   β”‚   └── whisper_arch.py     : Whisper architecture
β”‚   β”œβ”€β”€ generation_utils.py  : utilities for text and sequence generation
β”‚   β”œβ”€β”€ hparams.py           : hyperparameter management
β”‚   └── simple_models.py     : defines classical models such as CNN / RNN / MLP and siamese
β”œβ”€β”€ custom_train_objects     : custom objects used in training / testing
β”œβ”€β”€ loggers                  : logging utilities for tracking experiment progress
β”œβ”€β”€ models                   : main directory for model classes
β”‚   β”œβ”€β”€ interfaces           : directories for interface classes
β”‚   β”œβ”€β”€ stt                  : STT implementations
β”‚   β”‚   β”œβ”€β”€ base_stt.py      : abstract base class for all STT models
β”‚   β”‚   └── whisper.py       : Whisper implementation
β”‚   └── weights_converter.py : utilities to convert weights between different models
β”œβ”€β”€ tests                    : unit and integration tests for model validation
β”œβ”€β”€ utils                    : utility functions for data processing and visualization
β”œβ”€β”€ LICENCE                  : project license file
β”œβ”€β”€ README.md                : this file
β”œβ”€β”€ requirements.txt         : required packages
└── speech_to_text.ipynb     : notebook demonstrating model creation + STT features

Check the main project for more information about the unextended modules / structure / main classes.

Available features

  • Speech-To-Text (module models.stt) :
Feature Function / class Description
Speech-To-Text stt Perform STT on audio / video files
Search search Search for words in audio / video and display timestamps

The speech_to_text notebook provides a concrete demonstration of the stt and search functions.

Available models

Model architectures

Available architectures:

  • Whisper: OpenAI's Whisper multilingual STT model with transformer architecture

Model weights

The Whisper models are automatically downloaded and converted from the transformers library.

Installation and usage

See the installation guide for a step-by-step installation πŸ˜„

Here is a summary of the installation procedure, if you have a working python environment :

  1. Clone this repository: git clone https://github.com/xxxxx/speech_to_text.git
  2. Go to the root of this repository: cd speech_to_text
  3. Install requirements: pip install -r requirements.txt
  4. Open the speech_to_text notebook and follow the instructions!

Important Note : The TensorRT-LLM support for Whisper is currently limited to the version 0.15.0 of the library, requiring a python 3.10 environment. See the installation guide mentionned above for a step-by-step installation ;)

TO-DO list:

  • Make the TO-DO list
  • Comment the code
  • Add multilingual model support (Whisper)
  • Add Beam-Search text decoding
  • Add streaming support
  • Convert Whisper pretrained models from the transformers hub
  • Support TensorRT-LLM for inference

Search and partial alignment

Even though Whisper produces high-quality transcriptions, it can still make mistakes, making exact-match searches ineffective. To address this limitation, the proposed search method leverages the Edit distance to compute a similarity score between the search text and the produced transcription. This allows matches to be defined based on a tolerance threshold rather than exact matching!

For instance, searching cat in the ct is on the chair will not find an exact match for cat, while ct has only 1 mismatch (the missing a).

The Levenshtein distance produces an alignment between cat and ct with a distance matrix:

c t
0 1 2
c 1 0 1
a 2 1 1
t 3 2 1

The bottom-right value is 1, which represents the total number of operations (addition/deletion/replacements) needed to transform the hypothesis (ct) into the reference (cat).

The value at index i, j is the minimum between:

  • matrix[i-1][j] + deletion cost of character i (in hypothesis)
  • matrix[i-1][j-1] + replacement cost of character i (in hypothesis) and character j (of truth) (equal to 0 if both are the same character)
  • matrix[i][j-1] + insertion cost of character j (of hypothesis)

Note: To simplify the examples, all costs have been set to 1, but they can be specified in the edit_distance function (e.g., punctuation may have a cost of 0).

The objective is to align cat at all positions of the transcript (the ct is). For this purpose, the solution sets the 1st line to 0, allowing alignment at each position without penalizing the position of the alignment:

t h e c t i s
0 0 0 0 0 0 0 0 0 0
c 1 1 1 1 1 0 1 1 1 1
a 2 2 2 2 2 1 1 2 2 2
t 3 2 3 3 3 2 1 2 3 3

Note: Scores will be more relevant for longer search terms, as they're less influenced by small variations.

An example is provided in the speech_to_text notebook to better illustrate how the search works.

Notes and references

This section proposes useful projects, papers and tutorials to learn more about Speech-To-Text (STT) techniques, models and frameworks.

Key Concepts in STT

  1. Acoustic Modeling: Converting audio signals into phonetic representations
  2. Language Modeling: Determining the probability of word sequences
  3. Feature Extraction: Converting raw audio into spectrograms or MFCCs (Mel-frequency cepstral coefficients)
  4. Decoding: Translating acoustic features into text transcriptions

Popular STT Approaches

  1. CTC (Connectionist Temporal Classification): Used in DeepSpeech and Jasper
  2. Seq2Seq with Attention: Used in models like Listen, Attend and Spell
  3. Transformer-based approaches: Used in Whisper and SpeechT5
  4. RNN-Transducer (RNN-T): Used in production systems like Google's speech recognition

Papers

Tutorials

GitHub projects

Contacts and licence

Contacts:

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See the LICENSE file for details.

This license allows you to use, modify, and distribute the code, as long as you include the original copyright and license notice in any copy of the software/source. Additionally, if you modify the code and distribute it, or run it on a server as a service, you must make your modified version available under the same license.

For more information about the AGPL-3.0 license, please visit the official website

Citation

If you find this project useful in your work, please add this citation to give it more visibility! πŸ˜‹

@misc{yui-mhcp
    author  = {yui},
    title   = {A Deep Learning projects centralization},
    year    = {2021},
    publisher   = {GitHub},
    howpublished    = {\url{https://github.com/yui-mhcp}}
}