Deep Speech 2

This repository contains an implementation of the paper Deep Speech 2: End-to-End Speech Recognition, a state-of-the-art ASR model designed for end-to-end speech-to-text transcription using deep learning techniques. The implementation leverages Lightning AI ⚡ for efficient training and experimentation.

📜 Paper & Blog Reviews

🚀 Installation

Clone the repository:

git clone https://github.com/LuluW8071/Deep-Speech-2.git
cd Deep-Speech-2

Install dependencies:
```
pip install -r requirements.txt
```
Ensure you have PyTorch and Lightning AI installed.

📖 Usage

🔥 Training

Important: Before training, make sure to set your Comet ML API key and project name in the .env file.

To train the Deep Speech 2 model with default configurations:

python3 train.py

To customize the training parameters, modify train.py or pass arguments:

Argument	Description	Default
`-g`, `--gpus`	Number of GPUs per node	`1`
`-w`, `--num_workers`	Number of data loading workers	`4`
`-db`, `--dist_backend`	Distributed backend	`'ddp_find_unused_parameters_true'`
`-m`, `--model_type`	Type of RNN (`lstm` or `gru`)	`'lstm'`
`-cl`, `--resnet_layers`	Number of residual CNN layers	`2`
`-nl`, `--rnn_layers`	Number of RNN layers	`3`
`-rd`, `--rnn_dim`	RNN hidden size	`512`
`--epochs`	Number of training epochs	`50`
`--batch_size`	Batch size	`32`
`-gc`, `--grad_clip`	Gradient clipping	`0.6`
`-lr`, `--learning_rate`	Learning rate	`2e-4`
`--precision`	Precision mode	`'16-mixed'`
`--checkpoint_path`	Path to checkpoint file	`None`

🧊 Export TorchScript Model

python3 freeze.py --model_checkpoint saved_checkpoint/deepspeech2.ckpt

🎙️ Inference

To perform inference using a trained model:

python3 demo.py --model_path optimized_model.pt --share

📊 Experiment Results

The model was trained on LibriSpeech train set (100 + 360 + 500 hours) and validated on the LibriSpeech test set (~10.5 hours) using 16-bit mixed precision.

🔗 Download Checkpoint: Google Drive Link

Model Performance

Model Type	ResCNN Layers	RNN Layers	RNN Dim	Epochs	Batch Size	Grad Clip	LR
BiLSTM	2	3	512	25	64	0.6	2e-4

📉 Loss Curves

📝 WER & CER Metrics (Greedy Decoding)

🔍 Beam Search Decoding

Word Score	LM Weight	N-gram LM	Beam Size	Beam Threshold
-0.26	0.3	4-gram	25	10

🔎 Alignments Visualization

🔗 Citations

@misc{amodei2015deepspeech2endtoend,
      title={Deep Speech 2: End-to-End Speech Recognition in English and Mandarin},
      author={Dario Amodei and Rishita Anubhai and Eric Battenberg and Carl Case and others},
      year={2015},
      url={https://arxiv.org/abs/1512.02595}
}

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
assets		assets
neuralnet		neuralnet
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
demo.py		demo.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Speech 2

📜 Paper & Blog Reviews

🚀 Installation

📖 Usage

🔥 Training

🧊 Export TorchScript Model

🎙️ Inference

📊 Experiment Results

Model Performance

📉 Loss Curves

📝 WER & CER Metrics (Greedy Decoding)

🔍 Beam Search Decoding

🔎 Alignments Visualization

🔗 Citations

About

Releases

Packages

Languages

License

LuluW8071/Deep-Speech-2

Folders and files

Latest commit

History

Repository files navigation

Deep Speech 2

📜 Paper & Blog Reviews

🚀 Installation

📖 Usage

🔥 Training

🧊 Export TorchScript Model

🎙️ Inference

📊 Experiment Results

Model Performance

📉 Loss Curves

📝 WER & CER Metrics (Greedy Decoding)

🔍 Beam Search Decoding

🔎 Alignments Visualization

🔗 Citations

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages