Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add step to whisperx role to cache models required for transcription #1607

Merged
merged 9 commits into from
Feb 4, 2025

Conversation

philmcmahon
Copy link
Contributor

@philmcmahon philmcmahon commented Jan 22, 2025

What does this change?

In order to run transcription, whisperx relies on a number of different models that need to be fetched from hugging face or pytorch. I want to pre-bake all of these models into the AMI so that transcription can run offline.

Fetching the models is a somewhat fiddly process - the best I could do was write this python script to download the models https://github.com/guardian/transcription-service/blob/add-whisperx-support/scripts/download_whisperx_models.py - though I'll hopefully in future add it as a function to whisperx. The script is fetched from the amigo-data buckets, see guardian/transcription-service#130 for details of how it gets there

Also - remove torchvision role which isn't required for transcription.

How to test

I've tested this on CODE - the resulting AMI was able to transcribe files without an internet connection

@philmcmahon philmcmahon changed the title Pm whisperx models Add step to whisperx role to cache models required for transcription Jan 23, 2025
@philmcmahon philmcmahon marked this pull request as ready for review January 27, 2025 16:04
@philmcmahon philmcmahon requested a review from a team as a code owner January 27, 2025 16:04
@philmcmahon philmcmahon enabled auto-merge January 27, 2025 16:09
Comment on lines 23 to 24
url: "https://raw.githubusercontent.com/guardian/transcription-service/refs/heads/add-whisperx-support/scripts/download_whisperx_models.py"
dest: "/tmp/download_whisperx_models.py"
Copy link
Member

@akash1810 akash1810 Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referencing a file on a feature branch feels a little fragile, as it'll break if/when the branch is deleted. Should guardian/transcription-service perform an aws-s3 deployment (from main) with Riff-Raff and AMIgo reference that file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would have to deploy to one of the amigo-data buckets for this to work without having a new cross account role/making the bucket public, so I'd need a new riff-raff.yaml file with deploy tools set as the stack so riffraff uses the right keyring to deploy. Do you think that sounds like a good solution?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that sounds like a good solution?

Yep, this sounds good to me. Might be worth adding inline comments to the file too to explain what's happening for future travellers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good shout, I've made those changes and added the documentation

Copy link
Contributor Author

@philmcmahon philmcmahon Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(associated transcription pr guardian/transcription-service#130)

Copy link
Member

@akash1810 akash1810 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of non-blocking comments.

- name: Download models script
shell: |
aws --quiet s3 cp s3://amigo-data-{{ model_script_stage.lower() }}/deploy/{{ model_script_stage }}/whisperx-model-fetch/download_whisperx_models.py /tmp/download_whisperx_models.py
exit 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this exit 0 needed?

# - https://github.com/guardian/transcription-service/pull/130
- name: Download models script
shell: |
aws --quiet s3 cp s3://amigo-data-{{ model_script_stage.lower() }}/deploy/{{ model_script_stage }}/whisperx-model-fetch/download_whisperx_models.py /tmp/download_whisperx_models.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make the bucket a parameter too? Interestingly, the cdk-base role uses a bucket w/out the stage suffix.

# If you are changing these parameters it may be helpful to run it locally to test the changes.
- name: Download whisperx models
command: "python3 /tmp/download_whisperx_models.py --whisper-models --diarization-models --torch-align-models --huggingface-token {{ huggingface_token }}"
become: yes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

become: yes?!?! Ansible's API is confusing! 😅

@philmcmahon philmcmahon merged commit 5a982e7 into main Feb 4, 2025
4 checks passed
@philmcmahon philmcmahon deleted the pm-whisperx-models branch February 4, 2025 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants