speech_separation

This is a repository for speech separation tasks.

This project is highly inspired by the paper[1], and is still working to improve the performance.

Data

AVspeech dataset : contains 4700 hours of video segments, from a total of 290k YouTube videos.

There are several preprocess functions in the lib. Including STFT, iSTFT, power-law compression etc.

The visual frames are transfered to 512 face embeddings with facenet pre-trained model[2].

Audio part : Dilated CNN + Bidirectional LSTM.

Video part : Still working.

Loss function : modified discriminative loss function inspired from paper[3].

Apply complex ratio mask (cRM) to enhance phase spectrum. Maintain the quality during transformation by hyperbolic tangent fucntion.[4]

The model will be evaluated by signal-to-distortion ratio.