-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add step to whisperx role to cache models required for transcription #1607
Conversation
f7b8e7b
to
c0b5225
Compare
750cc32
to
580f8bd
Compare
roles/whisperx/tasks/main.yml
Outdated
url: "https://raw.githubusercontent.com/guardian/transcription-service/refs/heads/add-whisperx-support/scripts/download_whisperx_models.py" | ||
dest: "/tmp/download_whisperx_models.py" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Referencing a file on a feature branch feels a little fragile, as it'll break if/when the branch is deleted. Should guardian/transcription-service
perform an aws-s3
deployment (from main
) with Riff-Raff and AMIgo reference that file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would have to deploy to one of the amigo-data
buckets for this to work without having a new cross account role/making the bucket public, so I'd need a new riff-raff.yaml file with deploy tools set as the stack so riffraff uses the right keyring to deploy. Do you think that sounds like a good solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think that sounds like a good solution?
Yep, this sounds good to me. Might be worth adding inline comments to the file too to explain what's happening for future travellers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good shout, I've made those changes and added the documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(associated transcription pr guardian/transcription-service#130)
d412ebe
to
9e22529
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of non-blocking comments.
- name: Download models script | ||
shell: | | ||
aws --quiet s3 cp s3://amigo-data-{{ model_script_stage.lower() }}/deploy/{{ model_script_stage }}/whisperx-model-fetch/download_whisperx_models.py /tmp/download_whisperx_models.py | ||
exit 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this exit 0
needed?
# - https://github.com/guardian/transcription-service/pull/130 | ||
- name: Download models script | ||
shell: | | ||
aws --quiet s3 cp s3://amigo-data-{{ model_script_stage.lower() }}/deploy/{{ model_script_stage }}/whisperx-model-fetch/download_whisperx_models.py /tmp/download_whisperx_models.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make the bucket a parameter too? Interestingly, the cdk-base
role uses a bucket w/out the stage suffix.
# If you are changing these parameters it may be helpful to run it locally to test the changes. | ||
- name: Download whisperx models | ||
command: "python3 /tmp/download_whisperx_models.py --whisper-models --diarization-models --torch-align-models --huggingface-token {{ huggingface_token }}" | ||
become: yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
become: yes
?!?! Ansible's API is confusing! 😅
What does this change?
In order to run transcription, whisperx relies on a number of different models that need to be fetched from hugging face or pytorch. I want to pre-bake all of these models into the AMI so that transcription can run offline.
Fetching the models is a somewhat fiddly process - the best I could do was write this python script to download the models https://github.com/guardian/transcription-service/blob/add-whisperx-support/scripts/download_whisperx_models.py - though I'll hopefully in future add it as a function to whisperx. The script is fetched from the amigo-data buckets, see guardian/transcription-service#130 for details of how it gets there
Also - remove torchvision role which isn't required for transcription.
How to test
I've tested this on CODE - the resulting AMI was able to transcribe files without an internet connection