Replies: 3 comments 4 replies
-
Strange... When I tested the PR #2248 I checked disassembled Example 0 and then Example 1, so I make a conclusion that the speed up is due to reduced I just disassembled code with the suggested change: Details
Disassembled: Details
This code loads correctly, but the speed is clearly slower...
|
Beta Was this translation helpful? Give feedback.
-
So, I don't know what the Metal compiler on your computer does, but on my M2 Max, the kernel below gives me a run time of 18.7 ms/token for 7B
|
Beta Was this translation helpful? Give feedback.
-
Well, yes, it looks like the Metal compiler does need a hand here and there.
This is not my experience. In fact, letting a thread in a simd group compute half a block at a time for |
Beta Was this translation helpful? Give feedback.
-
Apple doesn't provide any kind of disassembler as a part of developer tools to allow us take a look at the low level stuff. However, thanks to the efforts from open source community we do have a functional disassembler.
Usage
Clone the repository https://github.com/dougallj/applegpu, and run the command:
python compiler_explorer.py test.metal
You can have any kind of macros or templates in you
.metal
file, but you can only have onekernel
function. Detailed explanations for each instruction can be found at https://dougallj.github.io/applegpu/docs.html. I feel that instructions for Apple GPU are a bit like RISC, where you have instructions to load and store some values between memory and registers, and some other instructions to operate on registers, but no "load-operate" instructions.Example 0
Here is the
kernel_mul_mat_q4_0_f32
function from master branch. I removed some logics for simplicity.Details
And here is the disassembled code (very long):
Details
The kernel start at the label
compute shader:
, and I usually analyze the structure by findingjmp
anddevice_load
instructions. In this code there are twojmp_exec_any
at0x9b2
and0x9cc
, jumping to0x298
and0x164
. There are corresponding the two loops in our codes. Between0x164
and0x298
we can see a block ofdevice_load
:The first
device_load
loads 4i32
to 32-bit registersr5,r6,r7,r8
from address stored inr39,r40
. The whole 8device_load
load 32float
. That's the 32float
stored iny_curr
before starting the inner loop. Thewait 1
means waiting untilgroup 1
load instructions finished, which are the first fourdevice_load
in this block.Between
0x298
and0x9b2
is our inner loop, we still look fordevice_load
. Notice thatr48h
means the high 16-bit of registerr48
andr48l
means the low 16-bit.In our code the inner loop runs 4 times, with each time loading one block. In the assembly the inner loop actually runs 2 times, each time loading 2 blocks. When it loads a block, it first loads one 16-bit, then another 16-bit, then four 16-bit, then two 16-bit and finally one last 16-bit. Before and after these
device_load
we also see a lotmov
andbfeil
, meaning the GPU copys a 16-bit to another register and mask its high or low 8-bit.Example 1
Here is the
kernel_mul_mat_q4_0_f32
function from PR #2248 . I removed some logics for simplicity.Details
And here is the disassembled code (very long):
Details
The structure is similar to Example 0, so we only analyze the inner loop.:
Now when it loads a block it first loads one 16-bit, then four 16-bit and four 16-bit. Before and after these
device_load
we don't seemov
andbfeil
any more because now we directly operate on 16-bit values.Beta Was this translation helpful? Give feedback.
All reactions