This repository contains the code used for subtitle generation for the PostgresFM podcast.
- Here, the process is shown for the 60th episode of the podcast.
To auto-generate the initial transcript, I used Deepgram (Whisper). The json file was also formatted for easier reading.
export YOUR_SECRET=INSERT_DEEPGRAM_KEY
curl -X POST \
-H "Authorization: Token $YOUR_SECRET" \
-H 'content-type: application/json' \
-d '{"url":"https://https://rupostgres.org/060%20Decoupled%20storage%20and%20compute.m4a"}' \
"https://api.deepgram.com/v1/listen?model=whisper-large&punctuate=true&smart_format=true&diarize=true" > subs.json
bash
cat result.json | jq -r '.' > formatted.json
formatted.json gives us separate words with their timecodes. The goal here was to split them into phrases based on the following factors:
- who is speaking at the moment
- character count
- punctuation Besides that, the subtitles are returned in the YouTube-friendly format (.srt).
The algorithm can be found in whisper2srt.py.
##3 Enhancing subtitles using Chat GPT API After running the script from the previous step, we get subtitles that are better than the usual auto-generated subtitles but are still far from perfect. The next goal is to use Chat GPT to fix misspellings and incorrectly recognized words. This is the prompt that was used (prompt.txt):
Below are subtitles, auto-generated by OpenAI Whisper, for a podcast PostgresFM episode
- This is auto-generated (voice2text), help me improve it
- DO NOT CHANGE WORDS! ONLY FIX TYPOS AND TERMS!
- Do NOT add anything, subtitles need to stay true to the video
- Do not change meanings
- Hunt for typos and incorrectly recognized terms, but do not correct grammar mistakes
- Use the glossary to fix incorrectly recognized words
- Leave it in the subtitle format
Glossary:
PostgresFM
Postgres, PostgreSQL
...
Following is the long list of words that are context-specific and are most likely to be incorrectly recognized.
The algorithm can be found in subs_gpt_no_key.py.
Of course, the subtitles received from this process are not perfect and require manual correction. However, they made the process of generating accurate subitles for the podcast much easier and faster.