Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metal : separate scale and mask from QKT in FA kernel #9189

Merged
merged 3 commits into from
Aug 26, 2024

Conversation

ggerganov
Copy link
Member

alt #9187

With this change, each thread in the simdgroup accesses the same data (ss[j*TF + tiisg]) when applying the scale/mask/logitcap and the softmax after that. This way, no synchronization is necessary

@ggerganov ggerganov mentioned this pull request Aug 26, 2024
4 tasks
@slaren
Copy link
Member

slaren commented Aug 26, 2024

This also works on M3 Max. Though if the data is actually local to the thread, then storing it in a local may be faster than using shared memory.

@ggerganov
Copy link
Member Author

Yup, there is a small improvement by keeping it local:

make -j && ./scripts/compare-commits.sh \
    e865686c218a5eb2be1a537e3a5e7b9a2acefdde \
    ff23e8e9f09be1eefefd9e1940aa1018854604fe \
    -m ./models/tinyllama-1b/ggml-model-f16.gguf \
    -m ./models/tinyllama-1b/ggml-model-q8_0.gguf \
    -m ./models/tinyllama-1b/ggml-model-q4_0.gguf \
    -m ./models/llama-8b-v3/ggml-model-f16.gguf -r 10 -fa 1 -t 4
CPU Model Model Size [GiB] Test t/s e865686 t/s gg/metal-fix-fa-3 Speedup
M2 Ultra llama 1B F16 2.05 pp512 7620.16 7711.19 1.01
M2 Ultra llama 1B F16 2.05 tg128 150.25 151.36 1.01
M2 Ultra llama 1B Q4_0 0.59 pp512 6965.17 7041.40 1.01
M2 Ultra llama 1B Q4_0 0.59 tg128 242.68 243.96 1.01
M2 Ultra llama 1B Q8_0 1.09 pp512 6875.29 6924.46 1.01
M2 Ultra llama 1B Q8_0 1.09 tg128 207.83 207.43 1.00
M2 Ultra llama 8B F16 14.96 pp512 1358.81 1396.95 1.03
M2 Ultra llama 8B F16 14.96 tg128 38.84 38.89 1.00

@slaren
Copy link
Member

slaren commented Aug 26, 2024

I tried to test it, but I am afraid that this MBP doesn't handle the summer heat very well, and the results are very inconsistent even with a large number of repetitions. Overall it seems faster, though.

CPU Model Model Size [GiB] Num. of Par. Test t/s master t/s gg/metal-fix-fa-3 Speedup
llama 1B Q8_0 1.09 1100048384 pp512 4921.15 4908.74 1.00
llama 7B F16 12.55 6738415616 pp512 823.43 861.02 1.05
CPU Model Model Size [GiB] Num. of Par. Test t/s master t/s gg/metal-fix-fa-3 Speedup
llama 1B Q8_0 1.09 1100048384 pp2048 4425.88 3649.76 0.82
llama 7B F16 12.55 6738415616 pp2048 627.21 452.36 0.72
llama 7B Q4_0 3.56 6738415616 pp2048 660.62 375.44 0.57
CPU Model Model Size [GiB] Num. of Par. Test t/s master t/s gg/metal-fix-fa-3 Speedup
llama 1B Q8_0 1.09 1100048384 pp2048 2503.88 2598.13 1.04
llama 7B F16 12.55 6738415616 pp2048 331.05 372.34 1.12
llama 7B Q4_0 3.56 6738415616 pp2048 440.86 473.36 1.07
CPU Model Model Size [GiB] Num. of Par. Test t/s master t/s gg/metal-fix-fa-3 Speedup
llama 1B Q8_0 1.09 1100048384 pp2048 2529.39 2985.77 1.18
llama 7B F16 12.55 6738415616 pp2048 392.69 426.41 1.09
llama 7B Q4_0 3.56 6738415616 pp2048 500.22 511.85 1.02
CPU Model Model Size [GiB] Num. of Par. Test t/s master t/s gg/metal-fix-fa-3 Speedup
llama 1B Q8_0 1.09 1100048384 pp512 3095.63 3463.95 1.12
llama 7B F16 12.55 6738415616 pp512 425.79 434.49 1.02
llama 7B Q4_0 3.56 6738415616 pp512 526.61 523.61 0.99

@ggerganov ggerganov merged commit 06658ad into master Aug 26, 2024
49 checks passed
@ggerganov ggerganov deleted the gg/metal-fix-fa-3 branch August 26, 2024 15:31
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* metal : separate scale and mask from QKT in FA kernel

* metal : ne01 check no longer necessary

* metal : keep data in local memory
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* metal : separate scale and mask from QKT in FA kernel

* metal : ne01 check no longer necessary

* metal : keep data in local memory
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Feb 25, 2025
* metal : separate scale and mask from QKT in FA kernel

* metal : ne01 check no longer necessary

* metal : keep data in local memory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants