Audio Quality Assessment with Transformer-Based Learning

This GitHub project introduces a novel approach to audio quality assessment using transformer-based deep learning architecture. The proposed model leverages the power of transformers to process audio data, providing enhanced performance over traditional approaches. This README provides an overview of the architecture, model configuration, and the tools used for this project.

Architecture Overview and Model Configuration

The proposed model employs a transformer-based deep learning approach to assess audio quality. It takes hand-crafted features concatenated into a vector as input and is trained with corresponding ground-truth labels. The transformer architecture, comprising an Encoder-Decoder structure with Multi-Head Attention (MHA) and Feed-Forward layers, processes the data. We utilized four layers of the encoder, set the number of heads (h) in each MHA to four, and employed an Adam optimizer. The model outputs a single continuous value representing audio quality in the range of 1 to 5. These design choices optimize feature vectors while considering attention mechanisms for enhanced performance.

We integrated the Dual Encoder Cross attention proposed in [2] with the model proposed in [1]. There are 4 layers and in each layer 3 attention blocks are used. Each attention block has 4 attention heads. The two blocks take their key, query and values and inputs. Thirds block takes output of block1 as Query and Values and output of block 2 as Key. This Propose model showed better results as shown in Results Section.

Results

The proposed architecture with Dual encoder cross attention has been trained on the concatenated features of MFCC + MelSpectogram + Chroma CQT. The results are stored in table 1.

Table 1: Comparison of performance of proposed model against the model in [1] which performs better than other quality techniques

Metric	PLCC	SRCC	KRCC
Proposed Model	0.828	0.823	0.629
Model propose in [1]	0.816	0.812	0.613
Proposed model with 4 attention head in cross attention block	0.823	0.821	0.619

Ablation Study

Table 2: Ablation Study of model proposed in [1] on different features individually.

Features	PLCC	SRCC	KRCC
MFCC	0.642	0.623	0.449
MelSpectogram	0.578	0.566	0.400
Chroma CQT	0.321	0.345	0.241
SPectral Contrast	0.227	0.207	0.141

To study the contribution of different features we trained the model proposed in [1] on individual features. The correlation between the predicted output of the trained model and actual values is stored in Table21. It shows the best features are in order MFCC > Melspectogram > Chroma CQT > SPectral COntrast. Other than these none of the features(PNCC, Spectral Centroid) showed promising results.

Table 3: Ablation Study of model proposed in [1] on different combination of features.

Experiments	PLCC	SRCC	KRCC
MFCC + MelSpectogram + Chroma CQT	0.816	0.812	0.613
MFCC + MelSpectogram + Spectral Contrast	0.747	0.736	0.5430
MFCC + Melspectogram + Chroma CQT + Spectral Contrast	0.730	0.726	0.538
MFCC + Melspectogram + Chroma CQT + SPectral Contrast + PNCC	0.721	0.716	0.530
MFCC + Melspectogram + Chroma CQT + PNCC	0.297	0.445	0.305

As individual contribution is not enough to arrive at a conclusion, we also studied the performance of the model on the input of different combinations of features. To study this we trained the model proposed in [1] on the combinations shown in table 3. The correlation between the predicted output of the trained model and actual values is also shown in Table 2. As the dataset used has 2075 audio samples, concatenating too many features or features having large dimensions results in degradation of results due to Curse of Dimensionality.

We propose that in case of larger dataset, which can also be made using data augmentation, MFCC + Melspectogram + Chroma CQT + Spectral Contrast should be used but for this study we used MFCC + Melspectogram + Chroma CQT

Dependencies

To install the required dependencies, simply run the following command:

pip install -r requirements.txt

Please ensure that you have these libraries installed to run the project.

Usage

To use this project, follow these steps:

Clone this GitHub repository.
Install the required dependencies.
Train the model on your audio quality assessment dataset.
Evaluate the model's performance.

Acknowledgment

We would like to acknowledge the support and contributions of the open-source community in making this project possible. Additionally, we extend our gratitude to the following researchers and their papers:

1. Transformer-based quality assessment model for generalized user-generated multimedia audio content

Dataset Credits: The dataset used in this project was generously provided by Mumtaz, D., Jena, A., Jakhetiya, V., Nathwani, K., and Guntuku, S.C. as described in their paper, "Transformer-based quality assessment model for generalized user-generated multimedia audio content" (Proc. Interspeech 2022, 674-678, doi: 10.21437/Interspeech.2022-10386).

2. Improved Transformer Model for Enhanced Monthly Streamflow Predictions of the Yangtze River

We acknowledge the work of C. Liu, D. Liu, and L. Mu as described in their paper, "Improved Transformer Model for Enhanced Monthly Streamflow Predictions of the Yangtze River" (IEEE Access, vol. 10, pp. 58240-58253, 2022, doi: 10.1109/ACCESS.2022.3178521).

We appreciate the valuable contributions of these researchers and the resources they provided for our project.

Team Members

This project was made possible by the efforts of our team members:

Ashutosh Chauhan
Dakshi Goel
Aman Kumar
Devin Chugh
Shreya Jain

Contributing

We welcome contributions to enhance this project. If you would like to contribute, please follow the standard GitHub pull request process.

For any questions or issues, please open a GitHub issue in this repository.

Thank you for your interest in our audio quality assessment project!

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
dataset		dataset
images		images
model		model
notebooks		notebooks
papers		papers
.gitignore		.gitignore
README.md		README.md
feature_extraction.py		feature_extraction.py
loss.jpg		loss.jpg
requirements.txt		requirements.txt
train.py		train.py
train_pytorch.py		train_pytorch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio Quality Assessment with Transformer-Based Learning

Architecture Overview and Model Configuration

Results

Ablation Study

Table 2: Ablation Study of model proposed in [1] on different features individually.

Table 3: Ablation Study of model proposed in [1] on different combination of features.

Dependencies

Usage

Acknowledgment

Team Members

Contributing

About

Releases

Packages

Contributors 5

Languages

ashutoshc8101/audio-quality-assessment

Folders and files

Latest commit

History

Repository files navigation

Audio Quality Assessment with Transformer-Based Learning

Architecture Overview and Model Configuration

Results

Ablation Study

Table 2: Ablation Study of model proposed in [1] on different features individually.

Table 3: Ablation Study of model proposed in [1] on different combination of features.

Dependencies

Usage

Acknowledgment

Team Members

Contributing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages