metal : separate scale and mask from QKT in FA kernel #9189

ggerganov · 2024-08-26T13:19:02Z

With this change, each thread in the simdgroup accesses the same data (ss[j*TF + tiisg]) when applying the scale/mask/logitcap and the softmax after that. This way, no synchronization is necessary

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

slaren · 2024-08-26T13:31:23Z

This also works on M3 Max. Though if the data is actually local to the thread, then storing it in a local may be faster than using shared memory.

ggerganov · 2024-08-26T13:42:06Z

Yup, there is a small improvement by keeping it local:

make -j && ./scripts/compare-commits.sh \
    e865686c218a5eb2be1a537e3a5e7b9a2acefdde \
    ff23e8e9f09be1eefefd9e1940aa1018854604fe \
    -m ./models/tinyllama-1b/ggml-model-f16.gguf \
    -m ./models/tinyllama-1b/ggml-model-q8_0.gguf \
    -m ./models/tinyllama-1b/ggml-model-q4_0.gguf \
    -m ./models/llama-8b-v3/ggml-model-f16.gguf -r 10 -fa 1 -t 4

CPU	Model	Model Size [GiB]	Test	t/s `e865686`	t/s gg/metal-fix-fa-3	Speedup
M2 Ultra	llama 1B F16	2.05	pp512	7620.16	7711.19	1.01
M2 Ultra	llama 1B F16	2.05	tg128	150.25	151.36	1.01
M2 Ultra	llama 1B Q4_0	0.59	pp512	6965.17	7041.40	1.01
M2 Ultra	llama 1B Q4_0	0.59	tg128	242.68	243.96	1.01
M2 Ultra	llama 1B Q8_0	1.09	pp512	6875.29	6924.46	1.01
M2 Ultra	llama 1B Q8_0	1.09	tg128	207.83	207.43	1.00
M2 Ultra	llama 8B F16	14.96	pp512	1358.81	1396.95	1.03
M2 Ultra	llama 8B F16	14.96	tg128	38.84	38.89	1.00

slaren · 2024-08-26T14:03:47Z

I tried to test it, but I am afraid that this MBP doesn't handle the summer heat very well, and the results are very inconsistent even with a large number of repetitions. Overall it seems faster, though.

CPU	Model	Model Size [GiB]	Num. of Par.	Test	t/s master	t/s gg/metal-fix-fa-3	Speedup
	llama 1B Q8_0	1.09	1100048384	pp512	4921.15	4908.74	1.00
	llama 7B F16	12.55	6738415616	pp512	823.43	861.02	1.05

Model	Model Size [GiB]	Num. of Par.	Test	t/s master	t/s gg/metal-fix-fa-3	Speedup
llama 1B Q8_0	1.09	1100048384	pp2048	4425.88	3649.76	0.82
llama 7B F16	12.55	6738415616	pp2048	627.21	452.36	0.72
llama 7B Q4_0	3.56	6738415616	pp2048	660.62	375.44	0.57

Model	Model Size [GiB]	Num. of Par.	Test	t/s master	t/s gg/metal-fix-fa-3	Speedup
llama 1B Q8_0	1.09	1100048384	pp2048	2503.88	2598.13	1.04
llama 7B F16	12.55	6738415616	pp2048	331.05	372.34	1.12
llama 7B Q4_0	3.56	6738415616	pp2048	440.86	473.36	1.07

Model	Model Size [GiB]	Num. of Par.	Test	t/s master	t/s gg/metal-fix-fa-3	Speedup
llama 1B Q8_0	1.09	1100048384	pp2048	2529.39	2985.77	1.18
llama 7B F16	12.55	6738415616	pp2048	392.69	426.41	1.09
llama 7B Q4_0	3.56	6738415616	pp2048	500.22	511.85	1.02

Model	Model Size [GiB]	Num. of Par.	Test	t/s master	t/s gg/metal-fix-fa-3	Speedup
llama 1B Q8_0	1.09	1100048384	pp512	3095.63	3463.95	1.12
llama 7B F16	12.55	6738415616	pp512	425.79	434.49	1.02
llama 7B Q4_0	3.56	6738415616	pp512	526.61	523.61	0.99

* metal : separate scale and mask from QKT in FA kernel * metal : ne01 check no longer necessary * metal : keep data in local memory

metal : separate scale and mask from QKT in FA kernel

e65fc9b

ggerganov mentioned this pull request Aug 26, 2024

metal : fix fa kernel #9187

Closed

4 tasks

metal : ne01 check no longer necessary

e865686

metal : keep data in local memory

ff23e8e

ggerganov merged commit 06658ad into master Aug 26, 2024
49 checks passed

ggerganov deleted the gg/metal-fix-fa-3 branch August 26, 2024 15:31

ggerganov mentioned this pull request Aug 26, 2024

metal : another fix for the fa kernel #9188

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal : separate scale and mask from QKT in FA kernel #9189

metal : separate scale and mask from QKT in FA kernel #9189

ggerganov commented Aug 26, 2024

slaren commented Aug 26, 2024

ggerganov commented Aug 26, 2024

slaren commented Aug 26, 2024

metal : separate scale and mask from QKT in FA kernel #9189

metal : separate scale and mask from QKT in FA kernel #9189

Conversation

ggerganov commented Aug 26, 2024

slaren commented Aug 26, 2024

ggerganov commented Aug 26, 2024

slaren commented Aug 26, 2024