Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebalancing Metal threads workload in dot product kernel kernel_mul_mv_f16_f32_l4 #7522

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

izard
Copy link

@izard izard commented May 24, 2024

This pull request is related to issue #6089. When profiling Metal implementation for large (16k+) token size prompts, I found that most of the time is spent in kernel_mul_mv_f16_f32_l4 Metal kernel. During this time GPU ALUs utilization is 7%, because current implementation fires as many threads as there are tokens, and each thread only performs 4 FP operations (plus reduction), so GPU is mostly starting and stopping threads. This applies to non-batched generation, when adding batching utilization goes up.

This change makes it spawn 32x less threads, and each thread to perform 32x more operations. This brings GPU ALUs utilization to 99%, and provides significant performance improvement for generation speeds for large contexts.
For 16384 context, I measured 1.3x improvement on M2 Max, and for 96k context I measured 1.8x improvement on M2 Max and 2.4x improvement on M3 Max.

For small context (less than 1k) I measure the same or slightly worse performance, and to avoid it the kernel selector lines
if (ne01 > 128) {
could be replaced e.g. with
if (ne01 > 8192) {

izard added 3 commits May 24, 2024 11:50
Most of the time, kernel_mul_mv_f16_f32_l4 is called to perform 4 FP ops per thread. Added kernel_mul_mv_f16_f32_l4_large which performs 128 FP ops per thread, when there are 32x less threads.
…l4_large

replaced call to kernel_mul_mv_f16_f32_l4 with kernel_mul_mv_f16_f32_l4_large for vectors larger than 128 elements.
@mofosyne mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels May 25, 2024
@ggerganov
Copy link
Member

The following command generates garbage:

make -j && ./main -m ./models/mistral-7b-v0.2/ggml-model-fp16.gguf -p "I believe the meaning of life is" -n 64 -s 2 -ngl 99 --temp 0 -t 4

<s> I believe the meaning of life is to work▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅

Here is a possible fix:

diff --git a/ggml-metal.m b/ggml-metal.m
index 3b525071..7a758fb2 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -1574,6 +1574,8 @@ static enum ggml_status ggml_metal_graph_compute(
 
                             id<MTLComputePipelineState> pipeline = nil;
 
+                            bool is_large = false;
+
                             // use custom matrix x vector kernel
                             switch (src0t) {
                                 case GGML_TYPE_F32:
@@ -1592,6 +1594,7 @@ static enum ggml_status ggml_metal_graph_compute(
                                             } else if (ne00 >= 128 && ne01 >= 8 && ne00%4 == 0) {
                                                 if (ne01 > 128) {
                                                     pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_MUL_MV_F16_F32_L4_LARGE].pipeline;
+                                                    is_large = true;
                                                 } else {
                                                     pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_MUL_MV_F16_F32_L4].pipeline;
                                                 }
@@ -1784,7 +1787,7 @@ static enum ggml_status ggml_metal_graph_compute(
                                 [encoder dispatchThreadgroups:MTLSizeMake((ne01 + 1)/2, ne11, ne12*ne13) threadsPerThreadgroup:MTLSizeMake(nth0, nth1, 1)];
                             } else {
                                 const int64_t ny = (ne11 + nrows - 1)/nrows;
-                                if (ne01 > 128) {
+                                if (is_large) {
                                     [encoder dispatchThreadgroups:MTLSizeMake(ne01/32, ny, ne12*ne13) threadsPerThreadgroup:MTLSizeMake(nth0, nth1, 1)];
                                 } else {
                                     [encoder dispatchThreadgroups:MTLSizeMake(ne01, ny, ne12*ne13) threadsPerThreadgroup:MTLSizeMake(nth0, nth1, 1)];

@ggerganov
Copy link
Member

ggerganov commented May 25, 2024

This change is not a clear cut. For example for Mistral 7B where the head size is 128 there is indeed performance improvement:

make -j llama-bench && ./scripts/compare-commits.sh master pr/7522 -m models/mistral-7b-v0.2/ggml-model-fp16.gguf -t 4 -p 0 -n 0 -pg 512,128 -pg 1024,128 -pg 2048,128
CPU Model Test t/s master t/s pr/7522 Speedup
M2 Ultra llama 7B F16 pp512+tg128 171.77 170.79 0.99
M2 Ultra llama 7B F16 pp1024+tg128 274.28 276.58 1.01
M2 Ultra llama 7B F16 pp2048+tg128 417.75 432.48 1.04

However, for Gemma 2B where the head size is 256, there is significant regression:

make -j llama-bench && ./scripts/compare-commits.sh master pr/7522 -m models/gemma-2b/ggml-model-f16.gguf -t 4 -p 0 -n 0 -pg 512,128 -pg 1024,128 -pg 2048,128
CPU Model Test t/s master t/s pr/7522 Speedup
M2 Ultra gemma 2B F16 (guessed) pp512+tg128 441.31 374.80 0.85
M2 Ultra gemma 2B F16 (guessed) pp1024+tg128 723.41 590.80 0.82
M2 Ultra gemma 2B F16 (guessed) pp2048+tg128 1152.24 888.42 0.77

For Phi-3 where the head size is 96, there is no difference between this PR and master:

make -j llama-bench && ./scripts/compare-commits.sh master pr/7522 -m models/phi-3-mini-128k-instruct/ggml-model-f16.gguf -t 4 -p 0 -n 0 -pg 512,128 -pg 2048,128 -pg 8192,128 -pg 32768,128
CPU Model Test t/s master t/s pr/7522 Speedup
M2 Ultra phi3 3B F16 pp512+tg128 259.57 259.96 1.00
M2 Ultra phi3 3B F16 pp2048+tg128 613.04 613.62 1.00
M2 Ultra phi3 3B F16 pp8192+tg128 922.70 923.41 1.00

These tests are all with disabled Flash Attention. If we enable Flash Attention, then this kernel is never executed, and the performance is universally better thanks to the more efficient attention computation via FA

large dot product kernel selection is now consistent
@izard
Copy link
Author

izard commented May 25, 2024

Thank you very much for the fix which lets it consistently select the new kernel.

Regarding the performance difference, I have to admit that I overlooked testing smaller models and small context sizes (only ran several sanity checks, and committed it with >128, not >8192 check only to make testing catch more bugs.

Thank you for bringing up the flash attention, I quickly ran several tests on my M2 Max and M3 Max (need to get to the office on Tuesday to run more tests on various Mac systems). I see mixed results comparing performance of Flash attention vs this patch, so I'll profile Flash attention on large context on Tuesday to see if
the corresponding hotspot Metal kernel is utilising memory bandwidth and ALUs well.

Command line:
bin/main -m model.gguf -f prompt.txt -c 49152 -n 128 --temp 0
Results (mind the Y axis range, in milliseconds per token eval time):
Image 5-25-24 at 2 29 PM
(Updated the graph, re-run b'marks with "high power mode" on. Numbers did not change much since originally posted, but difference is now somewhat lower)
Looking at this performance data I think it might make sense to add this as possible optimisation under a command line parameter.

@izard
Copy link
Author

izard commented May 29, 2024

With flash attention on, kernel_flash_attn_ext_vec_f16_h128 is where it spends ~70% of time in large context eval. It is very far from being limited by memory throughput, and GPU ALUs average utilization is 28%. Actually, the utilization is 35%, but every 5th GPU core is just sitting idle. The kernel is quite complex so would take me some time to understand how to improve GPU utilization, but dealing with idle GPUs should be easier and would could get up to ~12% perf improvement.

Unlike MacBook Pro, when benchmarking different LLMs on Mac Studio I see consistently better performance with Flash attention comparing to this dot product fix.

@ggerganov
Copy link
Member

Unlike MacBook Pro, when benchmarking different LLMs on Mac Studio I see consistently better performance with Flash attention comparing to this dot product fix.

I guess it's because I'm developing on Mac Studio, so the kernel performs better in that case. It goes back to the same problem described in #6089 (comment) - not sure what is the proper way to write the Metal kernels so that they perform optimally on all chips and models

@izard
Copy link
Author

izard commented May 29, 2024

I profiled flash attention kernel on Mac Studio, and while it is still significantly faster than the original and the attention with the fix, it still consistently leaving 20% of GPU cores on every Mac type I tested it on completely idle, and slightly underutilizing the 80% of GPU cores it is running on.

So I no longer see benefit of including this patch: it is significantly faster than original attention, but is somewhat slower than flash attention in most configs, except few corner cases end users would probably not care about.

So I'll check how to improve flash attention Metal kernel performance now.

@ggerganov
Copy link
Member

So I'll check how to improve flash attention Metal kernel performance now.

Yes, we can potentially benefit a lot from optimizing the Flash Attention kernels. Also one big limitation is that the Head Size = 256 kernels run out of registers, so as of now they are disabled. This means that models like Gemma that use HS = 256 cannot run with Flash Attention enabled

@izard
Copy link
Author

izard commented May 30, 2024

Head Size = 256 kernels run out of registers, so as of now they are disabled

This is platform dependent, newer GPUs do not have this constraint any longer (using cache as registers, then just spill to memory). I'll check it too, found your fix 3 days ago.

@ggerganov
Copy link
Member

Yes, it does work on M2 Ultra, maybe thanks to the new mechanism to spill into memory, but it is very slow

@izard
Copy link
Author

izard commented May 30, 2024

Yes, it does work on M2 Ultra, maybe thanks to the new mechanism to spill into memory, but it is very slow

Spilling mechanism becomes pretty efficient with M3 GPU.

@izard
Copy link
Author

izard commented Jun 7, 2024

It differs between models, but the issue # one is that for most I tried, the flash attention kernel starts only 32 thread groups, and thread groups are statically scheduled to cores, so for Macs with more than 32 GPU cores (so most Macs) some cores are just idle. So for 40 cores GPU optimization potential is ~1.25x. I'll try to see how to move work from threads to thread groups to improve GPU cores utilizations.

The issue # 2 is that as the kernel is long and complex, so the end ALU utilization for active GPU cores is less than 35%, and memory throughput is not a limiting factor. I suspect with clever optimizations there is another up to ~1.6x improvement potential here, will try to see if I can do something after core utilization is fixed.

@mengbingrock
Copy link

This pull request is related to issue #6089. When profiling Metal implementation for large (16k+) token size prompts, I found that most of the time is spent in kernel_mul_mv_f16_f32_l4 Metal kernel. During this time GPU ALUs utilization is 7%, because current implementation fires as many threads as there are tokens, and each thread only performs 4 FP operations (plus reduction), so GPU is mostly starting and stopping threads. This applies to non-batched generation, when adding batching utilization goes up.

This change makes it spawn 32x less threads, and each thread to perform 32x more operations. This brings GPU ALUs utilization to 99%, and provides significant performance improvement for generation speeds for large contexts. For 16384 context, I measured 1.3x improvement on M2 Max, and for 96k context I measured 1.8x improvement on M2 Max and 2.4x improvement on M3 Max.

For small context (less than 1k) I measure the same or slightly worse performance, and to avoid it the kernel selector lines if (ne01 > 128) { could be replaced e.g. with if (ne01 > 8192) {

Hi @izard ,
I'm new to metal debugging, could you please give me a gentle hint on metal debugging? Should I start to compile the llama.cpp in Xcode using Package.swift file? Or we do not need Xcode IDE at all?
Very much for your kindly help!

Best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants