AudioScore

AudioScore assesses the accuracy of audio descriptions by computing the product of the extracted audio-visual-text unit features, where CLIP is used to extract features for video frames and the corresponding descriptions and PaSST is used for audio waves.

$$\begin{aligned} \mathbf e_a=\mathrm{PaSST}(\mathbf A),\mathbf e_v=\mathrm{CLIP}(\mathbf V), \mathbf e_t = \mathrm{CLIP} (\mathbf T), \\\ s = \left(\frac{1}{2} \cos(\mathbf e_a, \mathbf e_t) + \frac{1}{2} \cos(\mathbf e_a, \mathbf e_v) + 1 \right) \times 0.5,\\\ \mathbb{AS}(\mathbf A,\mathbf V,\mathbf T) = \mathbf f(s), \mathbf f(x) = a \exp (-b\exp (-c x)) \end{aligned}$$

We set c=10 empirically, and choose values for a and b ($a=\frac{1}{e^{-0.69e^{-10}}}$, b=0.693) to force specific values of x and f(x) (x=1, f(x)=1 for $a=\frac{1}{e^{-0.69e^{-10}}}$, and x=0, f(x)=0.5 for b=0.693).

Installation

Activation environment
```
conda activate FAVDBench
```

Install CLIP

pip install ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git

ViT-B/32 is required to download under python env

import clip

clip.load("ViT-B/32", jit=False)

To proceed, download the required pretrained TriLip model and put it into the model directory.
```
Metrics/AudioScore
|-- models  
|   |--TriLip.bin
```
URL md5sum

TriLip 👍 TriLip.bin 6baef8a9b383fa7c94a4c56856b0af6d

Score

python score.py \
    --pred_path PATH_to_PREDICTION_JSON_in_COCO_FORMAT \

📝Note:

Prediction is requried to be converted into coco format.
Default path to store results locates in output directory.

Example

result.csv lists the sample results computed by the AudioScore.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

AudioScore

Installation

Score

Example

Files

README.md

Latest commit

History

README.md

File metadata and controls

AudioScore

Installation

Score

Example