Help on improving transcription quality #1948

i4lina · 2024-03-11T10:51:42Z

i4lina
Mar 11, 2024

I am currently exploring the utilization of Whisper.cpp for transcribing Italian recordings. Following successful compilation of whisper.cpp in WSL2 with CUDA support, I found that the provided JFK samples transcribed without any issues. However, when attempting to transcribe my recordings using the largest quantized model available, the results were considerably suboptimal.

Attached is a snippet of the audio I am working with, which I converted to a WAV file using FFmpeg, as outlined in the README (ffmpeg -i input.m4a -ar 16000 -ss 300 -ac 1 -c:a pcm_s16le -t 100 samples/test_rec.wav). I had to convert it to MP4 format to facilitate uploading to GitHub. You can access the audio snippet via the following link: Audio Snippet.

Upon running Whisper with the command ./main -m models/ggml-model-whisper-large-q5_0.bin -l auto samples/test_rec.wav, the output exhibited varied transcription quality:

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-model-whisper-large-q5_0.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 8
whisper_model_load: qntvr         = 1
whisper_model_load: type          = 5 (large)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070 SUPER, compute capability 7.5, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load:    CUDA0 total size =  1080.21 MB
whisper_model_load: model size    = 1080.10 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   34.82 MB
whisper_init_state: compute buffer (encode) =  926.66 MB
whisper_init_state: compute buffer (cross)  =    9.38 MB
whisper_init_state: compute buffer (decode) =  209.26 MB

system_info: n_threads = 4 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0

main: processing 'samples/test_rec.wav' (1600000 samples, 100.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = auto, task = transcribe, timestamps = 1 ...

whisper_full_with_state: auto-detected language: it (p = 0.998164)

[00:00:00.000 --> 00:00:02.000]   "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"!!!!"
[00:00:02.000 --> 00:00:14.000]   "Un vecchio che ha lavorato tutta la vita, si è sacrificato per i figli, ha comprato le case ai figli, e poi non ha i soldi e muore. Non è bello.
[00:00:14.000 --> 00:00:22.000]   I figli vendono la casa per pagare le cure del padre e della madre. È una cosa brutta, schifo, fanno pena.
[00:00:22.000 --> 00:00:52.000]  ! Non!!!!!! Un!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!..!!...!!..!..........................................................................................................................................
[00:00:52.000 --> 00:01:12.440]  ! Dob!iamo trovare un modello di finanziamento della sanità che non rimborsi i poveri, quindi
[00:01:12.440 --> 00:01:18.040]  !!!!!!!!!!!.non ridist!!!!.ribuis!ca in!!.. base al reddito, ma ridistribuisca in base
[00:01:18.040 --> 00:01:39.940]   all!età e!!!!!!!!!!!!!!!quindi hanno pensato all'assicurazione privata, quindi l'assicurazione privata ragazzi nasce proprio con questa idea, perché!!!!!!!!!!l'!assicurazione!!!! privata come funziona? Io pago i miei soldi di premio assicurativo, in teoria io ho


whisper_print_timings:     load time =  3945.97 ms
whisper_print_timings:     fallbacks =  11 p /   2 h
whisper_print_timings:      mel time =    76.61 ms
whisper_print_timings:   sample time =  8044.14 ms / 15973 runs (    0.50 ms per run)
whisper_print_timings:   encode time =   147.74 ms /     5 runs (   29.55 ms per run)
whisper_print_timings:   decode time =   319.11 ms /     1 runs (  319.11 ms per run)
whisper_print_timings:   batchd time = 44058.27 ms / 15928 runs (    2.77 ms per run)
whisper_print_timings:   prompt time =   821.50 ms /   634 runs (    1.30 ms per run)
whisper_print_timings:    total time = 57465.73 ms

While the transcription accurately captured some segments, it struggled significantly with others, despite their apparent audio similarities. I experimented with adjusting parameters such as beam size and entropy threshold, yet observed minimal, if any, improvement. What do the exclamation points mean?

I am seeking insights into potential areas of concern. Could the issue be attributed to the sampling rate of the WAV file? Is there a flaw in my setup? Additionally, I am open to exploring preprocessing steps that could enhance transcription performance on the original audio.

I appreciate any assistance or guidance provided.

Thank you.

Answered by zubbyy

Mar 12, 2024

Ran on Arch Linux 6.7.5, hyprland, i7-8565U, nvidia mx130, 8gb ram

Hey.
I'm no expert in this field, but since i wanted to get a little more into whisper.cpp since i'm gonna need it for a future project, i tried to take a look into your issue;
My approach was noise reduction, so i downloaded your mp4.

I then proceeded to convert it in a .wav file:
ffmpeg -i input.mp4 -ar 16000 -ac 1 -c:a pcm_s16le -t 100 output_dirty.wav

For the noise reduction i found the ffmpeg's afftdn filter:
ffmpeg -i output_dirty.wav -af "afftdn=nr=20:nf=-20:tn=1" output.wav

I then proceeded to run it on my large model.
./main -m models/ggml-large-v3.bin -l auto samples/output.wav

and the output was the following:

w…

View full answer

zubbyy · 2024-03-12T17:26:58Z

zubbyy
Mar 12, 2024

Ran on Arch Linux 6.7.5, hyprland, i7-8565U, nvidia mx130, 8gb ram

Hey.
I'm no expert in this field, but since i wanted to get a little more into whisper.cpp since i'm gonna need it for a future project, i tried to take a look into your issue;
My approach was noise reduction, so i downloaded your mp4.

I then proceeded to convert it in a .wav file:
ffmpeg -i input.mp4 -ar 16000 -ac 1 -c:a pcm_s16le -t 100 output_dirty.wav

For the noise reduction i found the ffmpeg's afftdn filter:
ffmpeg -i output_dirty.wav -af "afftdn=nr=20:nf=-20:tn=1" output.wav

I then proceeded to run it on my large model.
./main -m models/ggml-large-v3.bin -l auto samples/output.wav

and the output was the following:

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
whisper_model_load:      CPU total size =  3094.36 MB
whisper_model_load: model size    = 3094.36 MB
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   36.26 MB
whisper_init_state: compute buffer (encode) =  926.66 MB
whisper_init_state: compute buffer (cross)  =    9.38 MB
whisper_init_state: compute buffer (decode) =  209.26 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0
 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = auto, task = transcribe, timestamps = 1 ...

whisper_full_with_state: auto-detected language: en (p = 0.929830)

[00:00:00.300 --> 00:00:09.000]   And so, my fellow Americans, ask not what your country can do for you, ask what you
[00:00:09.000 --> 00:00:11.000]   can do for your country.


whisper_print_timings:     load time =  2301.97 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    20.84 ms
whisper_print_timings:   sample time =   103.78 ms /   147 runs (    0.71 ms per run)
whisper_print_timings:   encode time = 96743.59 ms /     2 runs (48371.80 ms per run)
whisper_print_timings:   decode time =   163.93 ms /     1 runs (  163.93 ms per run)
whisper_print_timings:   batchd time =  6590.31 ms /   145 runs (   45.45 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 105936.38 ms
[zubby@rusty whisper.cpp]$ ./main -m models/ggml-large-v3.bin -l auto samples/output.wav 
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
whisper_model_load:      CPU total size =  3094.36 MB
whisper_model_load: model size    = 3094.36 MB
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   36.26 MB
whisper_init_state: compute buffer (encode) =  926.66 MB
whisper_init_state: compute buffer (cross)  =    9.38 MB
whisper_init_state: compute buffer (decode) =  209.26 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/output.wav' (1600000 samples, 100.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = auto, task = transcribe, timestamps = 1 ...

whisper_full_with_state: auto-detected language: it (p = 0.984625)

[00:00:00.000 --> 00:00:09.280]   Magari un vecchio che ha lavorato tutta la vita, che è stato sacrificato per i figli, ha comprato le case ai figli,
[00:00:09.280 --> 00:00:13.580]   poi cosa fa? Non ha i soldi, muore. Non è bello.
[00:00:13.580 --> 00:00:18.680]   Allora i figli vengono da casa per pagarle pure il padre, la madre.
[00:00:18.680 --> 00:00:21.940]   E' una cosa brutta, fa schifo, fa pena.
[00:00:21.940 --> 00:00:26.420]   Un vecchio che si è cresciuto, e questa è la vostra madre, il vostro padre.
[00:00:26.940 --> 00:00:29.980]   Questo padre è abbandonato. Non è una cosa bella.
[00:00:29.980 --> 00:00:39.440]   Quindi questa cosa del problema degli anziani è un problema molto sentito anche negli Stati Uniti.
[00:00:39.440 --> 00:00:41.740]   Perché appunto non è che sono cattivi.
[00:00:41.740 --> 00:00:46.020]   Quindi i poveri no, però gli anziani facevano pena.
[00:00:46.020 --> 00:00:48.900]   Quindi non è una cosa bella.
[00:00:48.900 --> 00:00:56.340]   Quindi anche in America, bambini e anziani, bisogna trovare un modo,
[00:00:56.340 --> 00:00:56.920]   per esempio, di farlo.
[00:00:56.920 --> 00:00:58.900]   Per riuscire a sistemare il pochettino.
[00:00:58.900 --> 00:01:01.580]   Quindi cosa hanno pensato?
[00:01:01.580 --> 00:01:07.660]   Hanno pensato che dobbiamo trovare un modello di finanziamento della sanità
[00:01:07.660 --> 00:01:15.340]   che non rimborsi i poveri, quindi non redistribuisca in base al reddito,
[00:01:15.340 --> 00:01:18.900]   ma redistribuisca in base all'età.
[00:01:18.900 --> 00:01:23.640]   E quindi hanno pensato all'assicurazione privata.
[00:01:23.640 --> 00:01:26.460]   Quindi l'assicurazione privata, ragazzi,
[00:01:26.840 --> 00:01:28.740]   è proprio con questa idea.
[00:01:28.740 --> 00:01:29.660]   Perché?
[00:01:29.660 --> 00:01:33.140]   L'assicurazione privata come funziona?
[00:01:33.140 --> 00:01:37.580]   Io pago i miei soldi di premia assicurativa.
[00:01:37.580 --> 00:01:40.000]   In teoria, io pago i miei soldi di premia assicurativa.


whisper_print_timings:     load time =  5547.42 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =   256.76 ms
whisper_print_timings:   sample time =  1626.03 ms /  2060 runs (    0.79 ms per run)
whisper_print_timings:   encode time = 249195.28 ms /     5 runs (49839.05 ms per run)
whisper_print_timings:   decode time =   887.24 ms /     7 runs (  126.75 ms per run)
whisper_print_timings:   batchd time = 94935.29 ms /  2037 runs (   46.61 ms per run)
whisper_print_timings:   prompt time = 21945.01 ms /   580 runs (   37.84 ms per run)
whisper_print_timings:    total time = 374486.72 ms

so yeah it's definitely better than your previous output, surely with some tuning of afftdn the transcription will be more precise.
Tra l'altro suppongo tu sia italiano/a, a mio avviso non è così male come trascrizione, secondo me con un po' di tuning riuscirebbe perfettamente. Magari il tuo professore potrebbe prendersi un microfono migliore :) Comunque se lo stai usando per il tuo progetto di maturità o qualcosa del genere, contattami, potremmo collaborare per qualcosa.
Buona fortuna

1 reply

i4lina Mar 13, 2024
Author

Hey, can I have some contact info to keep this conversation going somewhere more appropriate? @zubbyy

zubbyy · 2024-04-03T16:47:14Z

zubbyy
Apr 3, 2024

sorry, i've just seen the email for some reason(?) you can find me on telegram at @zubbyTM Il giorno mer 13 mar 2024 alle ore 17:38 i4lina ***@***.***> ha scritto:

…

Hey, can I have some contact info to keep this conversation going somewhere more appropriate? @zubbyy <https://github.com/zubbyy> — Reply to this email directly, view it on GitHub <#1948 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANCC5DAFMJ2C2ILDQBGPNLTYYB6IBAVCNFSM6AAAAABEQB335GVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DONZWGEZTI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help on improving transcription quality #1948

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Help on improving transcription quality #1948

i4lina Mar 11, 2024

Replies: 2 comments · 1 reply

zubbyy Mar 12, 2024

i4lina Mar 13, 2024 Author

zubbyy Apr 3, 2024

i4lina
Mar 11, 2024

Replies: 2 comments 1 reply

zubbyy
Mar 12, 2024

i4lina Mar 13, 2024
Author

zubbyy
Apr 3, 2024