This GitHub project introduces a novel approach to audio quality assessment using transformer-based deep learning architecture. The proposed model leverages the power of transformers to process audio data, providing enhanced performance over traditional approaches. This README provides an overview of the architecture, model configuration, and the tools used for this project.
The proposed model employs a transformer-based deep learning approach to assess audio quality. It takes hand-crafted features concatenated into a vector as input and is trained with corresponding ground-truth labels. The transformer architecture, comprising an Encoder-Decoder structure with Multi-Head Attention (MHA) and Feed-Forward layers, processes the data. We utilized four layers of the encoder, set the number of heads (h) in each MHA to four, and employed an Adam optimizer. The model outputs a single continuous value representing audio quality in the range of 1 to 5. These design choices optimize feature vectors while considering attention mechanisms for enhanced performance.
We integrated the Dual Encoder Cross attention proposed in [2] with the model proposed in [1]. There are 4 layers and in each layer 3 attention blocks are used. Each attention block has 4 attention heads. The two blocks take their key, query and values and inputs. Thirds block takes output of block1 as Query and Values and output of block 2 as Key. This Propose model showed better results as shown in Results Section.
The proposed architecture with Dual encoder cross attention has been trained on the concatenated features of MFCC + MelSpectogram + Chroma CQT. The results are stored in table 1.
Table 1: Comparison of performance of proposed model against the model in [1] which performs better than other quality techniques
Metric | PLCC | SRCC | KRCC |
---|---|---|---|
Proposed Model | 0.828 | 0.823 | 0.629 |
Model propose in [1] | 0.816 | 0.812 | 0.613 |
Proposed model with 4 attention head in cross attention block | 0.823 | 0.821 | 0.619 |
Features | PLCC | SRCC | KRCC |
---|---|---|---|
MFCC | 0.642 | 0.623 | 0.449 |
MelSpectogram | 0.578 | 0.566 | 0.400 |
Chroma CQT | 0.321 | 0.345 | 0.241 |
SPectral Contrast | 0.227 | 0.207 | 0.141 |
To study the contribution of different features we trained the model proposed in [1] on individual features. The correlation between the predicted output of the trained model and actual values is stored in Table21. It shows the best features are in order MFCC > Melspectogram > Chroma CQT > SPectral COntrast. Other than these none of the features(PNCC, Spectral Centroid) showed promising results.
Experiments | PLCC | SRCC | KRCC |
---|---|---|---|
MFCC + MelSpectogram + Chroma CQT | 0.816 | 0.812 | 0.613 |
MFCC + MelSpectogram + Spectral Contrast | 0.747 | 0.736 | 0.5430 |
MFCC + Melspectogram + Chroma CQT + Spectral Contrast | 0.730 | 0.726 | 0.538 |
MFCC + Melspectogram + Chroma CQT + SPectral Contrast + PNCC | 0.721 | 0.716 | 0.530 |
MFCC + Melspectogram + Chroma CQT + PNCC | 0.297 | 0.445 | 0.305 |
As individual contribution is not enough to arrive at a conclusion, we also studied the performance of the model on the input of different combinations of features. To study this we trained the model proposed in [1] on the combinations shown in table 3. The correlation between the predicted output of the trained model and actual values is also shown in Table 2. As the dataset used has 2075 audio samples, concatenating too many features or features having large dimensions results in degradation of results due to Curse of Dimensionality.
We propose that in case of larger dataset, which can also be made using data augmentation, MFCC + Melspectogram + Chroma CQT + Spectral Contrast should be used but for this study we used MFCC + Melspectogram + Chroma CQT
To install the required dependencies, simply run the following command:
pip install -r requirements.txt
Please ensure that you have these libraries installed to run the project.
To use this project, follow these steps:
- Clone this GitHub repository.
- Install the required dependencies.
- Train the model on your audio quality assessment dataset.
- Evaluate the model's performance.
We would like to acknowledge the support and contributions of the open-source community in making this project possible. Additionally, we extend our gratitude to the following researchers and their papers:
1. Transformer-based quality assessment model for generalized user-generated multimedia audio content
Dataset Credits: The dataset used in this project was generously provided by Mumtaz, D., Jena, A., Jakhetiya, V., Nathwani, K., and Guntuku, S.C. as described in their paper, "Transformer-based quality assessment model for generalized user-generated multimedia audio content" (Proc. Interspeech 2022, 674-678, doi: 10.21437/Interspeech.2022-10386).
2. Improved Transformer Model for Enhanced Monthly Streamflow Predictions of the Yangtze River
We acknowledge the work of C. Liu, D. Liu, and L. Mu as described in their paper, "Improved Transformer Model for Enhanced Monthly Streamflow Predictions of the Yangtze River" (IEEE Access, vol. 10, pp. 58240-58253, 2022, doi: 10.1109/ACCESS.2022.3178521).
We appreciate the valuable contributions of these researchers and the resources they provided for our project.
This project was made possible by the efforts of our team members:
- Ashutosh Chauhan
- Dakshi Goel
- Aman Kumar
- Devin Chugh
- Shreya Jain
We welcome contributions to enhance this project. If you would like to contribute, please follow the standard GitHub pull request process.
For any questions or issues, please open a GitHub issue in this repository.
Thank you for your interest in our audio quality assessment project!