😋 Speech To Text (STT)

Check the CHANGELOG file to have a global overview of the latest modifications! 😋

Project structure

├── architectures            : utilities for model architectures
│   ├── layers               : custom layer implementations
│   ├── transformers         : transformer architecture implementations
│   │   └── whisper_arch.py     : Whisper architecture
│   ├── generation_utils.py  : utilities for text and sequence generation
│   ├── hparams.py           : hyperparameter management
│   └── simple_models.py     : defines classical models such as CNN / RNN / MLP and siamese
├── custom_train_objects     : custom objects used in training / testing
├── loggers                  : logging utilities for tracking experiment progress
├── models                   : main directory for model classes
│   ├── interfaces           : directories for interface classes
│   ├── stt                  : STT implementations
│   │   ├── base_stt.py      : abstract base class for all STT models
│   │   └── whisper.py       : Whisper implementation
│   └── weights_converter.py : utilities to convert weights between different models
├── tests                    : unit and integration tests for model validation
├── utils                    : utility functions for data processing and visualization
├── LICENCE                  : project license file
├── README.md                : this file
├── requirements.txt         : required packages
└── speech_to_text.ipynb     : notebook demonstrating model creation + STT features

Check the main project for more information about the unextended modules / structure / main classes.

Available features

Speech-To-Text (module models.stt) :

Feature	Function / class	Description
Speech-To-Text	`stt`	Perform `STT` on audio / video files
Search	`search`	Search for words in audio / video and display timestamps

The speech_to_text notebook provides a concrete demonstration of the stt and search functions.

Available models

Model architectures

Available architectures:

Whisper: OpenAI's Whisper multilingual STT model with transformer architecture

Model weights

The Whisper models are automatically downloaded and converted from the transformers library.

Installation and usage

See the installation guide for a step-by-step installation 😄

Here is a summary of the installation procedure, if you have a working python environment :

Clone this repository: git clone https://github.com/xxxxx/speech_to_text.git
Go to the root of this repository: cd speech_to_text
Install requirements: pip install -r requirements.txt
Open the speech_to_text notebook and follow the instructions!

Important Note : The TensorRT-LLM support for Whisper is currently limited to the version 0.15.0 of the library, requiring a python 3.10 environment. See the installation guide mentionned above for a step-by-step installation ;)

TO-DO list:

Make the TO-DO list
Comment the code
Add multilingual model support (Whisper)
Add Beam-Search text decoding
Add streaming support
Convert Whisper pretrained models from the transformers hub
Support TensorRT-LLM for inference

Search and partial alignment

Even though Whisper produces high-quality transcriptions, it can still make mistakes, making exact-match searches ineffective. To address this limitation, the proposed search method leverages the Edit distance to compute a similarity score between the search text and the produced transcription. This allows matches to be defined based on a tolerance threshold rather than exact matching!

For instance, searching cat in the ct is on the chair will not find an exact match for cat, while ct has only 1 mismatch (the missing a).

The Levenshtein distance produces an alignment between cat and ct with a distance matrix:

		c	t
	0	1	2
c	1	0	1
a	2	1	1
t	3	2	1

The bottom-right value is 1, which represents the total number of operations (addition/deletion/replacements) needed to transform the hypothesis (ct) into the reference (cat).

The value at index i, j is the minimum between:

matrix[i-1][j] + deletion cost of character i (in hypothesis)
matrix[i-1][j-1] + replacement cost of character i (in hypothesis) and character j (of truth) (equal to 0 if both are the same character)
matrix[i][j-1] + insertion cost of character j (of hypothesis)

Note: To simplify the examples, all costs have been set to 1, but they can be specified in the edit_distance function (e.g., punctuation may have a cost of 0).

The objective is to align cat at all positions of the transcript (the ct is). For this purpose, the solution sets the 1st line to 0, allowing alignment at each position without penalizing the position of the alignment:

		t	h	e		c	t		i	s
	0	0	0	0	0	0	0	0	0	0
c	1	1	1	1	1	0	1	1	1	1
a	2	2	2	2	2	1	1	2	2	2
t	3	2	3	3	3	2	1	2	3	3

Note: Scores will be more relevant for longer search terms, as they're less influenced by small variations.

An example is provided in the speech_to_text notebook to better illustrate how the search works.

Notes and references

This section proposes useful projects, papers and tutorials to learn more about Speech-To-Text (STT) techniques, models and frameworks.

Key Concepts in STT

Acoustic Modeling: Converting audio signals into phonetic representations
Language Modeling: Determining the probability of word sequences
Feature Extraction: Converting raw audio into spectrograms or MFCCs (Mel-frequency cepstral coefficients)
Decoding: Translating acoustic features into text transcriptions

Popular STT Approaches

CTC (Connectionist Temporal Classification): Used in DeepSpeech and Jasper
Seq2Seq with Attention: Used in models like Listen, Attend and Spell
Transformer-based approaches: Used in Whisper and SpeechT5
RNN-Transducer (RNN-T): Used in production systems like Google's speech recognition

Papers

Self-Supervised Learning for Speech Recognition: Overview paper on self-supervised approaches
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin: Original DeepSpeech2 paper
Jasper: An End-to-End Convolutional Neural Acoustic Model: The original Jasper paper
Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition: Original SpeechTransformer paper
A technique for computer detection and correction of spelling errors: Levenshtein distance paper
RNN-T for Latency Controlled ASR WITH IMPROVED BEAM SEARCH: RNN-Transducer paper
Conformer: Convolution-augmented Transformer for Speech Recognition: Conformer original paper
Whisper: Robust Speech Recognition via Large-Scale Weak Supervision: OpenAI's Whisper paper

Tutorials

Speech Recognition with TensorFlow: Official TensorFlow tutorial to get started with audio processing
Introduction to Automatic Speech Recognition: Hugging Face course on ASR basics
Speech Recognition with Wav2Vec2: Fine-tuning Wav2Vec2 for English ASR
End-to-End Speech Recognition Systems: Visual explanation of CTC and end-to-end systems
Keras tutorial: Tutorial on speech recognition with Transformers
Levenshtein distance computation: A Step-by-Step computation of the Levenshtein distance
NVIDIA NeMo project: Main website for NVIDIA NeMo project, containing many tutorials on NLP (ASR, TTS, etc.)

GitHub projects

LibriSpeech ASR with PyTorch: PyTorch example using the LibriSpeech dataset
Mozilla DeepSpeech Examples: Practical examples using Mozilla's implementation
Whisper Fine-Tuning Examples: Hugging Face examples for fine-tuning Whisper
NVIDIA's Jasper project: Original Jasper code
NVIDIA's NeMo project: Provides a PyTorch implementation of the Conformer and RNN-T models
Automatic Speech Recognition project: DeepSpeech2 implementation
OpenAI's Whisper: The official OpenAI implementation of Whisper (in PyTorch)
ESPnet: End-to-End Speech Processing Toolkit with various ASR implementations
SpeechBrain: PyTorch-based speech toolkit covering various speech tasks

Contacts and licence

Contacts:

Mail: [email protected]
Discord: yui0732

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See the LICENSE file for details.

This license allows you to use, modify, and distribute the code, as long as you include the original copyright and license notice in any copy of the software/source. Additionally, if you modify the code and distribute it, or run it on a server as a service, you must make your modified version available under the same license.

For more information about the AGPL-3.0 license, please visit the official website

Citation

If you find this project useful in your work, please add this citation to give it more visibility! 😋

@misc{yui-mhcp
    author  = {yui},
    title   = {A Deep Learning projects centralization},
    year    = {2021},
    publisher   = {GitHub},
    howpublished    = {\url{https://github.com/yui-mhcp}}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

😋 Speech To Text (STT)

Project structure

Available features

Available models

Model architectures

Model weights

Installation and usage

TO-DO list:

Search and partial alignment

Notes and references

Key Concepts in STT

Popular STT Approaches

Papers

Tutorials

GitHub projects

Contacts and licence

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
architectures		architectures
custom_train_objects		custom_train_objects
docker		docker
loggers		loggers
models		models
tests		tests
utils		utils
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENCE		LICENCE
README.md		README.md
audio_en.wav		audio_en.wav
convert_checkpoint.py		convert_checkpoint.py
requirements.txt		requirements.txt
speech_to_text.ipynb		speech_to_text.ipynb

License

yui-mhcp/speech_to_text

Folders and files

Latest commit

History

Repository files navigation

😋 Speech To Text (STT)

Project structure

Available features

Available models

Model architectures

Model weights

Installation and usage

TO-DO list:

Search and partial alignment

Notes and references

Key Concepts in STT

Popular STT Approaches

Papers

Tutorials

GitHub projects

Contacts and licence

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages