Check the CHANGELOG file to have a global overview of the latest modifications! π
βββ architectures : utilities for model architectures
β βββ layers : custom layer implementations
β βββ transformers : transformer architecture implementations
β β βββ whisper_arch.py : Whisper architecture
β βββ generation_utils.py : utilities for text and sequence generation
β βββ hparams.py : hyperparameter management
β βββ simple_models.py : defines classical models such as CNN / RNN / MLP and siamese
βββ custom_train_objects : custom objects used in training / testing
βββ loggers : logging utilities for tracking experiment progress
βββ models : main directory for model classes
β βββ interfaces : directories for interface classes
β βββ stt : STT implementations
β β βββ base_stt.py : abstract base class for all STT models
β β βββ whisper.py : Whisper implementation
β βββ weights_converter.py : utilities to convert weights between different models
βββ tests : unit and integration tests for model validation
βββ utils : utility functions for data processing and visualization
βββ LICENCE : project license file
βββ README.md : this file
βββ requirements.txt : required packages
βββ speech_to_text.ipynb : notebook demonstrating model creation + STT features
Check the main project for more information about the unextended modules / structure / main classes.
- Speech-To-Text (module
models.stt
) :
Feature | Function / class | Description |
---|---|---|
Speech-To-Text | stt |
Perform STT on audio / video files |
Search | search |
Search for words in audio / video and display timestamps |
The speech_to_text
notebook provides a concrete demonstration of the stt
and search
functions.
Available architectures:
- Whisper: OpenAI's Whisper multilingual STT model with transformer architecture
The Whisper
models are automatically downloaded and converted from the transformers
library.
See the installation guide for a step-by-step installation π
Here is a summary of the installation procedure, if you have a working python environment :
- Clone this repository:
git clone https://github.com/xxxxx/speech_to_text.git
- Go to the root of this repository:
cd speech_to_text
- Install requirements:
pip install -r requirements.txt
- Open the
speech_to_text
notebook and follow the instructions!
Important Note : The TensorRT-LLM
support for Whisper
is currently limited to the version 0.15.0
of the library, requiring a python 3.10
environment. See the installation guide mentionned above for a step-by-step installation ;)
- Make the TO-DO list
- Comment the code
- Add multilingual model support (
Whisper
) - Add Beam-Search text decoding
- Add streaming support
- Convert
Whisper
pretrained models from thetransformers
hub - Support TensorRT-LLM for inference
Even though Whisper
produces high-quality transcriptions, it can still make mistakes, making exact-match searches ineffective. To address this limitation, the proposed search
method leverages the Edit
distance to compute a similarity score between the search text and the produced transcription. This allows matches to be defined based on a tolerance threshold rather than exact matching!
For instance, searching cat in the ct is on the chair will not find an exact match for cat, while ct has only 1 mismatch (the missing a).
The Levenshtein distance produces an alignment between cat and ct with a distance matrix:
c | t | ||
---|---|---|---|
0 | 1 | 2 | |
c | 1 | 0 | 1 |
a | 2 | 1 | 1 |
t | 3 | 2 | 1 |
The bottom-right value is 1, which represents the total number of operations (addition/deletion/replacements) needed to transform the hypothesis (ct
) into the reference (cat
).
The value at index i, j is the minimum between:
- matrix[i-1][j] + deletion cost of character i (in hypothesis)
- matrix[i-1][j-1] + replacement cost of character i (in hypothesis) and character j (of truth) (equal to 0 if both are the same character)
- matrix[i][j-1] + insertion cost of character j (of hypothesis)
Note: To simplify the examples, all costs have been set to 1, but they can be specified in the edit_distance
function (e.g., punctuation may have a cost of 0).
The objective is to align cat at all positions of the transcript (the ct is). For this purpose, the solution sets the 1st line to 0, allowing alignment at each position without penalizing the position of the alignment:
t | h | e | c | t | i | s | ||||
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
c | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
a | 2 | 2 | 2 | 2 | 2 | 1 | 1 | 2 | 2 | 2 |
t | 3 | 2 | 3 | 3 | 3 | 2 | 1 | 2 | 3 | 3 |
Note: Scores will be more relevant for longer search terms, as they're less influenced by small variations.
An example is provided in the speech_to_text
notebook to better illustrate how the search works.
This section proposes useful projects, papers and tutorials to learn more about Speech-To-Text (STT)
techniques, models and frameworks.
- Acoustic Modeling: Converting audio signals into phonetic representations
- Language Modeling: Determining the probability of word sequences
- Feature Extraction: Converting raw audio into spectrograms or MFCCs (Mel-frequency cepstral coefficients)
- Decoding: Translating acoustic features into text transcriptions
- CTC (Connectionist Temporal Classification): Used in DeepSpeech and Jasper
- Seq2Seq with Attention: Used in models like Listen, Attend and Spell
- Transformer-based approaches: Used in Whisper and SpeechT5
- RNN-Transducer (RNN-T): Used in production systems like Google's speech recognition
- Self-Supervised Learning for Speech Recognition: Overview paper on self-supervised approaches
- Deep Speech 2: End-to-End Speech Recognition in English and Mandarin: Original DeepSpeech2 paper
- Jasper: An End-to-End Convolutional Neural Acoustic Model: The original Jasper paper
- Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition: Original SpeechTransformer paper
- A technique for computer detection and correction of spelling errors: Levenshtein distance paper
- RNN-T for Latency Controlled ASR WITH IMPROVED BEAM SEARCH: RNN-Transducer paper
- Conformer: Convolution-augmented Transformer for Speech Recognition: Conformer original paper
- Whisper: Robust Speech Recognition via Large-Scale Weak Supervision: OpenAI's Whisper paper
- Speech Recognition with TensorFlow: Official TensorFlow tutorial to get started with audio processing
- Introduction to Automatic Speech Recognition: Hugging Face course on ASR basics
- Speech Recognition with Wav2Vec2: Fine-tuning Wav2Vec2 for English ASR
- End-to-End Speech Recognition Systems: Visual explanation of CTC and end-to-end systems
- Keras tutorial: Tutorial on speech recognition with Transformers
- Levenshtein distance computation: A Step-by-Step computation of the Levenshtein distance
- NVIDIA NeMo project: Main website for NVIDIA NeMo project, containing many tutorials on NLP (ASR, TTS, etc.)
- LibriSpeech ASR with PyTorch: PyTorch example using the LibriSpeech dataset
- Mozilla DeepSpeech Examples: Practical examples using Mozilla's implementation
- Whisper Fine-Tuning Examples: Hugging Face examples for fine-tuning Whisper
- NVIDIA's Jasper project: Original Jasper code
- NVIDIA's NeMo project: Provides a PyTorch implementation of the
Conformer
andRNN-T
models - Automatic Speech Recognition project: DeepSpeech2 implementation
- OpenAI's Whisper: The official OpenAI implementation of Whisper (in PyTorch)
- ESPnet: End-to-End Speech Processing Toolkit with various ASR implementations
- SpeechBrain: PyTorch-based speech toolkit covering various speech tasks
Contacts:
- Mail:
[email protected]
- Discord: yui0732
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See the LICENSE file for details.
This license allows you to use, modify, and distribute the code, as long as you include the original copyright and license notice in any copy of the software/source. Additionally, if you modify the code and distribute it, or run it on a server as a service, you must make your modified version available under the same license.
For more information about the AGPL-3.0 license, please visit the official website
If you find this project useful in your work, please add this citation to give it more visibility! π
@misc{yui-mhcp
author = {yui},
title = {A Deep Learning projects centralization},
year = {2021},
publisher = {GitHub},
howpublished = {\url{https://github.com/yui-mhcp}}
}