AudioScore assesses the accuracy of audio descriptions by computing the product of the extracted audio-visual-text unit features, where CLIP
is used to extract features for video frames and the corresponding descriptions and PaSST
is used for audio waves.
We set c=10 empirically, and choose values for a and b (
-
Activation environment
conda activate FAVDBench
-
Install CLIP
pip install ftfy regex tqdm pip install git+https://github.com/openai/CLIP.git
-
ViT-B/32 is required to download under python env
import clip clip.load("ViT-B/32", jit=False)
-
To proceed, download the required pretrained TriLip model and put it into the
model
directory.Metrics/AudioScore |-- models | |--TriLip.bin
URL md5sum TriLip 👍 TriLip.bin 6baef8a9b383fa7c94a4c56856b0af6d
python score.py \
--pred_path PATH_to_PREDICTION_JSON_in_COCO_FORMAT \
📝Note:
Prediction
is requried to be converted into coco format.- Default path to store results locates in
output
directory.
result.csv lists the sample results computed by the AudioScore.