Rebalancing Metal threads workload in dot product kernel kernel_mul_mv_f16_f32_l4 #7522

izard · 2024-05-24T19:41:35Z

This pull request is related to issue #6089. When profiling Metal implementation for large (16k+) token size prompts, I found that most of the time is spent in kernel_mul_mv_f16_f32_l4 Metal kernel. During this time GPU ALUs utilization is 7%, because current implementation fires as many threads as there are tokens, and each thread only performs 4 FP operations (plus reduction), so GPU is mostly starting and stopping threads. This applies to non-batched generation, when adding batching utilization goes up.

This change makes it spawn 32x less threads, and each thread to perform 32x more operations. This brings GPU ALUs utilization to 99%, and provides significant performance improvement for generation speeds for large contexts.
For 16384 context, I measured 1.3x improvement on M2 Max, and for 96k context I measured 1.8x improvement on M2 Max and 2.4x improvement on M3 Max.

For small context (less than 1k) I measure the same or slightly worse performance, and to avoid it the kernel selector lines
if (ne01 > 128) {
could be replaced e.g. with
if (ne01 > 8192) {

Most of the time, kernel_mul_mv_f16_f32_l4 is called to perform 4 FP ops per thread. Added kernel_mul_mv_f16_f32_l4_large which performs 128 FP ops per thread, when there are 32x less threads.

…l4_large replaced call to kernel_mul_mv_f16_f32_l4 with kernel_mul_mv_f16_f32_l4_large for vectors larger than 128 elements.

ggerganov · 2024-05-25T09:39:13Z

The following command generates garbage:

make -j && ./main -m ./models/mistral-7b-v0.2/ggml-model-fp16.gguf -p "I believe the meaning of life is" -n 64 -s 2 -ngl 99 --temp 0 -t 4

<s> I believe the meaning of life is to work▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅

Here is a possible fix:

diff --git a/ggml-metal.m b/ggml-metal.m
index 3b525071..7a758fb2 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -1574,6 +1574,8 @@ static enum ggml_status ggml_metal_graph_compute(
 
                             id<MTLComputePipelineState> pipeline = nil;
 
+                            bool is_large = false;
+
                             // use custom matrix x vector kernel
                             switch (src0t) {
                                 case GGML_TYPE_F32:
@@ -1592,6 +1594,7 @@ static enum ggml_status ggml_metal_graph_compute(
                                             } else if (ne00 >= 128 && ne01 >= 8 && ne00%4 == 0) {
                                                 if (ne01 > 128) {
                                                     pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_MUL_MV_F16_F32_L4_LARGE].pipeline;
+                                                    is_large = true;
                                                 } else {
                                                     pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_MUL_MV_F16_F32_L4].pipeline;
                                                 }
@@ -1784,7 +1787,7 @@ static enum ggml_status ggml_metal_graph_compute(
                                 [encoder dispatchThreadgroups:MTLSizeMake((ne01 + 1)/2, ne11, ne12*ne13) threadsPerThreadgroup:MTLSizeMake(nth0, nth1, 1)];
                             } else {
                                 const int64_t ny = (ne11 + nrows - 1)/nrows;
-                                if (ne01 > 128) {
+                                if (is_large) {
                                     [encoder dispatchThreadgroups:MTLSizeMake(ne01/32, ny, ne12*ne13) threadsPerThreadgroup:MTLSizeMake(nth0, nth1, 1)];
                                 } else {
                                     [encoder dispatchThreadgroups:MTLSizeMake(ne01, ny, ne12*ne13) threadsPerThreadgroup:MTLSizeMake(nth0, nth1, 1)];

ggerganov · 2024-05-25T11:20:00Z

This change is not a clear cut. For example for Mistral 7B where the head size is 128 there is indeed performance improvement:

make -j llama-bench && ./scripts/compare-commits.sh master pr/7522 -m models/mistral-7b-v0.2/ggml-model-fp16.gguf -t 4 -p 0 -n 0 -pg 512,128 -pg 1024,128 -pg 2048,128

CPU	Model	Test	t/s master	t/s pr/7522	Speedup
M2 Ultra	llama 7B F16	pp512+tg128	171.77	170.79	0.99
M2 Ultra	llama 7B F16	pp1024+tg128	274.28	276.58	1.01
M2 Ultra	llama 7B F16	pp2048+tg128	417.75	432.48	1.04

However, for Gemma 2B where the head size is 256, there is significant regression:

make -j llama-bench && ./scripts/compare-commits.sh master pr/7522 -m models/gemma-2b/ggml-model-f16.gguf -t 4 -p 0 -n 0 -pg 512,128 -pg 1024,128 -pg 2048,128

CPU	Model	Test	t/s master	t/s pr/7522	Speedup
M2 Ultra	gemma 2B F16 (guessed)	pp512+tg128	441.31	374.80	0.85
M2 Ultra	gemma 2B F16 (guessed)	pp1024+tg128	723.41	590.80	0.82
M2 Ultra	gemma 2B F16 (guessed)	pp2048+tg128	1152.24	888.42	0.77

For Phi-3 where the head size is 96, there is no difference between this PR and master:

make -j llama-bench && ./scripts/compare-commits.sh master pr/7522 -m models/phi-3-mini-128k-instruct/ggml-model-f16.gguf -t 4 -p 0 -n 0 -pg 512,128 -pg 2048,128 -pg 8192,128 -pg 32768,128

CPU	Model	Test	t/s master	t/s pr/7522	Speedup
M2 Ultra	phi3 3B F16	pp512+tg128	259.57	259.96	1.00
M2 Ultra	phi3 3B F16	pp2048+tg128	613.04	613.62	1.00
M2 Ultra	phi3 3B F16	pp8192+tg128	922.70	923.41	1.00

These tests are all with disabled Flash Attention. If we enable Flash Attention, then this kernel is never executed, and the performance is universally better thanks to the more efficient attention computation via FA

large dot product kernel selection is now consistent

izard · 2024-05-25T20:29:56Z

Thank you very much for the fix which lets it consistently select the new kernel.

Regarding the performance difference, I have to admit that I overlooked testing smaller models and small context sizes (only ran several sanity checks, and committed it with >128, not >8192 check only to make testing catch more bugs.

Thank you for bringing up the flash attention, I quickly ran several tests on my M2 Max and M3 Max (need to get to the office on Tuesday to run more tests on various Mac systems). I see mixed results comparing performance of Flash attention vs this patch, so I'll profile Flash attention on large context on Tuesday to see if
the corresponding hotspot Metal kernel is utilising memory bandwidth and ALUs well.

Command line:
bin/main -m model.gguf -f prompt.txt -c 49152 -n 128 --temp 0
Results (mind the Y axis range, in milliseconds per token eval time):

(Updated the graph, re-run b'marks with "high power mode" on. Numbers did not change much since originally posted, but difference is now somewhat lower)
Looking at this performance data I think it might make sense to add this as possible optimisation under a command line parameter.

izard · 2024-05-29T04:06:36Z

With flash attention on, kernel_flash_attn_ext_vec_f16_h128 is where it spends ~70% of time in large context eval. It is very far from being limited by memory throughput, and GPU ALUs average utilization is 28%. Actually, the utilization is 35%, but every 5th GPU core is just sitting idle. The kernel is quite complex so would take me some time to understand how to improve GPU utilization, but dealing with idle GPUs should be easier and would could get up to ~12% perf improvement.

Unlike MacBook Pro, when benchmarking different LLMs on Mac Studio I see consistently better performance with Flash attention comparing to this dot product fix.

ggerganov · 2024-05-29T07:52:03Z

Unlike MacBook Pro, when benchmarking different LLMs on Mac Studio I see consistently better performance with Flash attention comparing to this dot product fix.

I guess it's because I'm developing on Mac Studio, so the kernel performs better in that case. It goes back to the same problem described in #6089 (comment) - not sure what is the proper way to write the Metal kernels so that they perform optimally on all chips and models

izard · 2024-05-29T20:11:33Z

I profiled flash attention kernel on Mac Studio, and while it is still significantly faster than the original and the attention with the fix, it still consistently leaving 20% of GPU cores on every Mac type I tested it on completely idle, and slightly underutilizing the 80% of GPU cores it is running on.

So I no longer see benefit of including this patch: it is significantly faster than original attention, but is somewhat slower than flash attention in most configs, except few corner cases end users would probably not care about.

So I'll check how to improve flash attention Metal kernel performance now.

ggerganov · 2024-05-30T12:42:29Z

So I'll check how to improve flash attention Metal kernel performance now.

Yes, we can potentially benefit a lot from optimizing the Flash Attention kernels. Also one big limitation is that the Head Size = 256 kernels run out of registers, so as of now they are disabled. This means that models like Gemma that use HS = 256 cannot run with Flash Attention enabled

izard · 2024-05-30T15:01:47Z

Head Size = 256 kernels run out of registers, so as of now they are disabled

This is platform dependent, newer GPUs do not have this constraint any longer (using cache as registers, then just spill to memory). I'll check it too, found your fix 3 days ago.

ggerganov · 2024-05-30T15:11:18Z

Yes, it does work on M2 Ultra, maybe thanks to the new mechanism to spill into memory, but it is very slow

izard · 2024-05-30T15:35:29Z

Yes, it does work on M2 Ultra, maybe thanks to the new mechanism to spill into memory, but it is very slow

Spilling mechanism becomes pretty efficient with M3 GPU.

izard · 2024-06-07T15:47:49Z

It differs between models, but the issue # one is that for most I tried, the flash attention kernel starts only 32 thread groups, and thread groups are statically scheduled to cores, so for Macs with more than 32 GPU cores (so most Macs) some cores are just idle. So for 40 cores GPU optimization potential is ~1.25x. I'll try to see how to move work from threads to thread groups to improve GPU cores utilizations.

The issue # 2 is that as the kernel is long and complex, so the end ALU utilization for active GPU cores is less than 35%, and memory throughput is not a limiting factor. I suspect with clever optimizations there is another up to ~1.6x improvement potential here, will try to see if I can do something after core utilization is fixed.

mengbingrock · 2024-09-24T00:43:24Z

This pull request is related to issue #6089. When profiling Metal implementation for large (16k+) token size prompts, I found that most of the time is spent in kernel_mul_mv_f16_f32_l4 Metal kernel. During this time GPU ALUs utilization is 7%, because current implementation fires as many threads as there are tokens, and each thread only performs 4 FP operations (plus reduction), so GPU is mostly starting and stopping threads. This applies to non-batched generation, when adding batching utilization goes up.

This change makes it spawn 32x less threads, and each thread to perform 32x more operations. This brings GPU ALUs utilization to 99%, and provides significant performance improvement for generation speeds for large contexts. For 16384 context, I measured 1.3x improvement on M2 Max, and for 96k context I measured 1.8x improvement on M2 Max and 2.4x improvement on M3 Max.

For small context (less than 1k) I measure the same or slightly worse performance, and to avoid it the kernel selector lines if (ne01 > 128) { could be replaced e.g. with if (ne01 > 8192) {

Hi @izard ,
I'm new to metal debugging, could you please give me a gentle hint on metal debugging? Should I start to compile the llama.cpp in Xcode using Package.swift file? Or we do not need Xcode IDE at all?
Very much for your kindly help!

Best

izard added 3 commits May 24, 2024 11:50

Added kernel_mul_mv_f16_f32_l4_large which performs 32x more ops

cd2322c

Most of the time, kernel_mul_mv_f16_f32_l4 is called to perform 4 FP ops per thread. Added kernel_mul_mv_f16_f32_l4_large which performs 128 FP ops per thread, when there are 32x less threads.

replaced call to kernel_mul_mv_f16_f32_l4 with kernel_mul_mv_f16_f32_…

b3d55bc

…l4_large replaced call to kernel_mul_mv_f16_f32_l4 with kernel_mul_mv_f16_f32_l4_large for vectors larger than 128 elements.

fixed typo in previous commit

26cb415

mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels May 25, 2024

Bug fix suggested by Georgi

aa3fd50

large dot product kernel selection is now consistent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebalancing Metal threads workload in dot product kernel kernel_mul_mv_f16_f32_l4 #7522

Rebalancing Metal threads workload in dot product kernel kernel_mul_mv_f16_f32_l4 #7522

izard commented May 24, 2024

ggerganov commented May 25, 2024

ggerganov commented May 25, 2024 •

edited

Loading

izard commented May 25, 2024 •

edited

Loading

izard commented May 29, 2024 •

edited

Loading

ggerganov commented May 29, 2024

izard commented May 29, 2024 •

edited

Loading

ggerganov commented May 30, 2024

izard commented May 30, 2024 •

edited

Loading

ggerganov commented May 30, 2024

izard commented May 30, 2024 •

edited

Loading

izard commented Jun 7, 2024

mengbingrock commented Sep 24, 2024

Rebalancing Metal threads workload in dot product kernel kernel_mul_mv_f16_f32_l4 #7522

Are you sure you want to change the base?

Rebalancing Metal threads workload in dot product kernel kernel_mul_mv_f16_f32_l4 #7522

Conversation

izard commented May 24, 2024

ggerganov commented May 25, 2024

ggerganov commented May 25, 2024 • edited Loading

izard commented May 25, 2024 • edited Loading

izard commented May 29, 2024 • edited Loading

ggerganov commented May 29, 2024

izard commented May 29, 2024 • edited Loading

ggerganov commented May 30, 2024

izard commented May 30, 2024 • edited Loading

ggerganov commented May 30, 2024

izard commented May 30, 2024 • edited Loading

izard commented Jun 7, 2024

mengbingrock commented Sep 24, 2024

ggerganov commented May 25, 2024 •

edited

Loading

izard commented May 25, 2024 •

edited

Loading

izard commented May 29, 2024 •

edited

Loading

izard commented May 29, 2024 •

edited

Loading

izard commented May 30, 2024 •

edited

Loading

izard commented May 30, 2024 •

edited

Loading