Skip to content

Implementation of Deep Speech 2 paper with BiGRU and BiLSTM using LibriSpeech Dataset

License

Notifications You must be signed in to change notification settings

LuluW8071/Deep-Speech-2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deep Speech 2

Status License Open Issues Closed Issues Open PRs Repo Size Last Commit

This repository contains an implementation of the paper Deep Speech 2: End-to-End Speech Recognition, a state-of-the-art ASR model designed for end-to-end speech-to-text transcription using deep learning techniques. The implementation leverages Lightning AI ⚡ for efficient training and experimentation.


📜 Paper & Blog Reviews


🚀 Installation

  1. Clone the repository:

    git clone https://github.com/LuluW8071/Deep-Speech-2.git
    cd Deep-Speech-2
  2. Install dependencies:

    pip install -r requirements.txt

    Ensure you have PyTorch and Lightning AI installed.


📖 Usage

🔥 Training

Important: Before training, make sure to set your Comet ML API key and project name in the .env file.

To train the Deep Speech 2 model with default configurations:

python3 train.py

To customize the training parameters, modify train.py or pass arguments:

Argument Description Default
-g, --gpus Number of GPUs per node 1
-w, --num_workers Number of data loading workers 4
-db, --dist_backend Distributed backend 'ddp_find_unused_parameters_true'
-m, --model_type Type of RNN (lstm or gru) 'lstm'
-cl, --resnet_layers Number of residual CNN layers 2
-nl, --rnn_layers Number of RNN layers 3
-rd, --rnn_dim RNN hidden size 512
--epochs Number of training epochs 50
--batch_size Batch size 32
-gc, --grad_clip Gradient clipping 0.6
-lr, --learning_rate Learning rate 2e-4
--precision Precision mode '16-mixed'
--checkpoint_path Path to checkpoint file None

🧊 Export TorchScript Model

python3 freeze.py --model_checkpoint saved_checkpoint/deepspeech2.ckpt

🎙️ Inference

To perform inference using a trained model:

python3 demo.py --model_path optimized_model.pt --share

📊 Experiment Results

The model was trained on LibriSpeech train set (100 + 360 + 500 hours) and validated on the LibriSpeech test set (~10.5 hours) using 16-bit mixed precision.

🔗 Download Checkpoint: Google Drive Link

Model Performance

Model Type ResCNN Layers RNN Layers RNN Dim Epochs Batch Size Grad Clip LR
BiLSTM 2 3 512 25 64 0.6 2e-4

📉 Loss Curves

Loss Curves

📝 WER & CER Metrics (Greedy Decoding)

Greedy Metrics

🔍 Beam Search Decoding

Word Score LM Weight N-gram LM Beam Size Beam Threshold
-0.26 0.3 4-gram 25 10

Beam Search Metrics

🔎 Alignments Visualization

Alignments


🔗 Citations

@misc{amodei2015deepspeech2endtoend,
      title={Deep Speech 2: End-to-End Speech Recognition in English and Mandarin},
      author={Dario Amodei and Rishita Anubhai and Eric Battenberg and Carl Case and others},
      year={2015},
      url={https://arxiv.org/abs/1512.02595}
}

About

Implementation of Deep Speech 2 paper with BiGRU and BiLSTM using LibriSpeech Dataset

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published