-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to add AVX 512-bit support #95
Conversation
Hi @ggerganov, I gave it a try out of curiosity on my i7-1165G7 on Ubuntu 22.04, it does not work but unfortunately there isn't much to report. The script just runs forever. Let me know if there is a way to provide verbose logs.
And my cpuinfo:
|
Yes its interesting!
|
Seems its much slower for all models |
Cpu info (all): fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave |
@lerela @ArtyomZemlyak |
Still not working. I tried with the There is a lot of variance between runs but here are some timings:
Called with |
tested on my : Xeon(R) Silver 4210R CPU @ 2.40GHz ( VM with 8 cores only ) avx512 branch :
master:
|
ggml.c
Outdated
const __m512 sum23 = _mm512_add_ps(sum2, sum3); | ||
const __m512 sum0123 = _mm512_add_ps(sum01, sum23); | ||
|
||
sumf = sum0123[0] + sum0123[1] + sum0123[2] + sum0123[3] + sum0123[4] + sum0123[5] + sum0123[6] + sum0123[7]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the bug - should sum up to sum0123[15]
@jaybinks |
I'll test it soon...
What timezone are you in ?
I'm in GMT +10 Australia,. Happy to jump on a call and test with you if you
like
…On Sun, 6 Nov 2022, 6:57 am Georgi Gerganov, ***@***.***> wrote:
@jaybinks <https://github.com/jaybinks>
Just pushed another fix
—
Reply to this email directly, view it on GitHub
<#95 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALQR67XJF2VD5IVXMTNXTTWG3C5LANCNFSM6AAAAAARPESVCE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
have re-tested, and it seems to be no better (possibly worse) before pulling your recent change :
after commit c435035
I'm not convinced it's worse, some runs were "total time 12sec". |
I found a machine with AVX-512 CPU and fixed the code. It now produces correct results, but the performance compared to AVX2 is worse. Either I am not utilising correctly the 512-bit instructions set, or it simply does not provide any benefit for this type of computation. I'll leave this draft PR if other people are interested in giving it a try, but for now I am not going to merge it. |
Hey, do you have any plans to add gpu support to whisper.cpp ?
Just curious what your plans are
…On Sun, 6 Nov 2022, 5:10 pm Georgi Gerganov, ***@***.***> wrote:
I found a machine with AVX-512 CPU and fixed the code. It now produces
correct results, but the performance compared to AVX2 is worse. Either I am
not utilising correctly the 512-bit instructions set, or it simply does not
provide any benefit for this type of computation. I'll leave this draft PR
if other people are interested in giving it a try, but for now I am not
going to merge it.
—
Reply to this email directly, view it on GitHub
<#95 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALQR6ZDVITCRDJVWMV6KMLWG5KWRANCNFSM6AAAAAARPESVCE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Adding GPU support is not out of the question, but it's low priority atm. Here are some additional thoughts on this: #126 |
May I suggest #define the AVX instructions per the type of SIMD instead of #define big code blocks. It's much easier that way to support different flavors of SIMD. You need to #define the number of floats that fit in the AVX register of course, and alter all the code to use SIMD_Register_Float_Size instead of the hardcoded sizes. It turns out not to be much work. Example code below senses AVX and uses 256 bit AVX, else regresses to 128 bit. For Whisper you probably want a 512 and 256 case. #if defined(AVX) #define SIMD_Int m256i_i32 #define SIMD_Int m128i_i32 |
FWIW, I remember from my own tests several years ago (around the first Coffee Lake processors), that using AVX-512 intrinsics caused frequency throttling that ended up in performance losses. This was confirmed by intel engineers. I don't know if that's still the case (and I can't test it), or what's happening in arm platforms though. |
|
Update:
This PR adds AVX 512-bit support. The performance compared to AVX2 is worse. Either I am not utilising correctly the 512-bit instructions set, or it simply does not provide any benefit for this type of computation. I'll leave this draft PR if other people are interested in giving it a try, but for now I am not going to merge it.
OUTDATED BELOW
WIP in progress
This is not tested because I don't have a AVX 512-bit CPU, so very likely that the code will fail.
Still, would appreciate if someone gives it a try and report issues.
@ArtyomZemlyak
Are you interested in giving it a try? I noticed you have a CPU with AVX 512-bit support