Add step to whisperx role to cache models required for transcription #1607

philmcmahon · 2025-01-22T09:46:23Z

What does this change?

In order to run transcription, whisperx relies on a number of different models that need to be fetched from hugging face or pytorch. I want to pre-bake all of these models into the AMI so that transcription can run offline.

Fetching the models is a somewhat fiddly process - the best I could do was write this python script to download the models https://github.com/guardian/transcription-service/blob/add-whisperx-support/scripts/download_whisperx_models.py - though I'll hopefully in future add it as a function to whisperx. The script is fetched from the amigo-data buckets, see guardian/transcription-service#130 for details of how it gets there

Also - remove torchvision role which isn't required for transcription.

How to test

I've tested this on CODE - the resulting AMI was able to transcribe files without an internet connection

akash1810 · 2025-01-29T09:29:55Z

roles/whisperx/tasks/main.yml

+      url: "https://raw.githubusercontent.com/guardian/transcription-service/refs/heads/add-whisperx-support/scripts/download_whisperx_models.py"
+      dest: "/tmp/download_whisperx_models.py"


Referencing a file on a feature branch feels a little fragile, as it'll break if/when the branch is deleted. Should guardian/transcription-service perform an aws-s3 deployment (from main) with Riff-Raff and AMIgo reference that file?

I think it would have to deploy to one of the amigo-data buckets for this to work without having a new cross account role/making the bucket public, so I'd need a new riff-raff.yaml file with deploy tools set as the stack so riffraff uses the right keyring to deploy. Do you think that sounds like a good solution?

Do you think that sounds like a good solution?

Yep, this sounds good to me. Might be worth adding inline comments to the file too to explain what's happening for future travellers.

Good shout, I've made those changes and added the documentation

(associated transcription pr guardian/transcription-service#130)

…ded in whisperx

akash1810

Couple of non-blocking comments.

akash1810 · 2025-02-04T15:05:06Z

roles/whisperx/tasks/main.yml

+- name: Download models script
+  shell: |
+    aws --quiet s3 cp s3://amigo-data-{{ model_script_stage.lower() }}/deploy/{{ model_script_stage }}/whisperx-model-fetch/download_whisperx_models.py /tmp/download_whisperx_models.py
+    exit 0


Is this exit 0 needed?

akash1810 · 2025-02-04T15:07:01Z

roles/whisperx/tasks/main.yml

+#  - https://github.com/guardian/transcription-service/pull/130
+- name: Download models script
+  shell: |
+    aws --quiet s3 cp s3://amigo-data-{{ model_script_stage.lower() }}/deploy/{{ model_script_stage }}/whisperx-model-fetch/download_whisperx_models.py /tmp/download_whisperx_models.py


Should we make the bucket a parameter too? Interestingly, the cdk-base role uses a bucket w/out the stage suffix.

akash1810 · 2025-02-04T15:08:04Z

roles/whisperx/tasks/main.yml

+# If you are changing these parameters it may be helpful to run it locally to test the changes.
+- name: Download whisperx models
+  command: "python3 /tmp/download_whisperx_models.py --whisper-models --diarization-models --torch-align-models --huggingface-token {{ huggingface_token }}"
+  become: yes


become: yes?!?! Ansible's API is confusing! 😅

philmcmahon force-pushed the pm-whisperx-models branch from f7b8e7b to c0b5225 Compare January 22, 2025 16:51

philmcmahon changed the title ~~Pm whisperx models~~ Add step to whisperx role to cache models required for transcription Jan 23, 2025

philmcmahon mentioned this pull request Jan 23, 2025

Add whisperx support (including diarization) guardian/transcription-service#123

Merged

philmcmahon force-pushed the pm-whisperx-models branch from 750cc32 to 580f8bd Compare January 27, 2025 16:04

philmcmahon marked this pull request as ready for review January 27, 2025 16:04

philmcmahon requested a review from a team as a code owner January 27, 2025 16:04

philmcmahon enabled auto-merge January 27, 2025 16:09

akash1810 reviewed Jan 29, 2025

View reviewed changes

philmcmahon added 7 commits January 31, 2025 11:22

Add script to download whisperx models

aed57a7

Remove steps to install/uninstall typer and hf hub as these are inclu…

cadb300

…ded in whisperx

Run download as ubuntu user

33cb5d0

Remove torchvision from whisperx role

628acc1

Parameterise whisperx pip install candiate

8864486

Fetch script from s3 instead of github

68f6a49

Use aws cli to fetch model download script

9e22529

philmcmahon force-pushed the pm-whisperx-models branch from d412ebe to 9e22529 Compare January 31, 2025 11:22

philmcmahon mentioned this pull request Jan 31, 2025

Add CI step for models script guardian/transcription-service#130

Merged

philmcmahon added 2 commits January 31, 2025 12:24

document model download steps

f95e4d3

Only have one parameter for model stage

9778014

akash1810 approved these changes Feb 4, 2025

View reviewed changes

philmcmahon merged commit 5a982e7 into main Feb 4, 2025
4 checks passed

philmcmahon deleted the pm-whisperx-models branch February 4, 2025 15:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add step to whisperx role to cache models required for transcription #1607

Add step to whisperx role to cache models required for transcription #1607

philmcmahon commented Jan 22, 2025 •

edited

Loading

akash1810 Jan 29, 2025 •

edited

Loading

philmcmahon Jan 29, 2025

akash1810 Jan 29, 2025

philmcmahon Jan 31, 2025

philmcmahon Jan 31, 2025 •

edited

Loading

akash1810 left a comment

akash1810 Feb 4, 2025

akash1810 Feb 4, 2025

akash1810 Feb 4, 2025

		url: "https://raw.githubusercontent.com/guardian/transcription-service/refs/heads/add-whisperx-support/scripts/download_whisperx_models.py"
		dest: "/tmp/download_whisperx_models.py"

Add step to whisperx role to cache models required for transcription #1607

Add step to whisperx role to cache models required for transcription #1607

Conversation

philmcmahon commented Jan 22, 2025 • edited Loading

What does this change?

How to test

akash1810 Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

philmcmahon Jan 29, 2025

Choose a reason for hiding this comment

akash1810 Jan 29, 2025

Choose a reason for hiding this comment

philmcmahon Jan 31, 2025

Choose a reason for hiding this comment

philmcmahon Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

akash1810 left a comment

Choose a reason for hiding this comment

akash1810 Feb 4, 2025

Choose a reason for hiding this comment

akash1810 Feb 4, 2025

Choose a reason for hiding this comment

akash1810 Feb 4, 2025

Choose a reason for hiding this comment

philmcmahon commented Jan 22, 2025 •

edited

Loading

akash1810 Jan 29, 2025 •

edited

Loading

philmcmahon Jan 31, 2025 •

edited

Loading