-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: Quantized matrix matrix multiplication #2160
CUDA: Quantized matrix matrix multiplication #2160
Conversation
I think this looks good. I imagine that you are already planning on doing this, but as long cuBLAS may be faster, it would be good to have the option to use it. But I think that at this performance level we could already use this by default. |
My current plan is to implement matrix vector kernels based on integer intrinsics for k-quants and then try to come up with a good way to create a template for matrix matrix multiplication. Ideally using tensor cores won't be too difficult to add then and this PR can be merged as a universal upgrade (but still with the option to use cuBLAS). |
those results are stunning! |
0f2a62c
to
60df883
Compare
After looking into it I think tensor cores should get their own kernel since the use of shared memory will need to be different. Instead I've done some performance optimizations:
I think this performance would be good enough to merge and use as the default; it's possible that the performance for other quantization formats will be worse though. |
60df883
to
31f229c
Compare
I pushed a template for matrix matrix multiplication. The template accepts three functions:
|
While I was testing your new changes (now including your last commit) on RTX 3060 12GB, I found an error while testing tulu-13B-q4_0 (43 layers offloaded) during the PP step that doesn't occur in master. When I process less than 512 tokens (510t), I get: I tested with pure llama-13b-q4_0 and I don't get this error, so maybe it is just this specific llama finetune. I will download others and test later. Related to PP speed results, I cannot continue to use tulu, but what I got so far for a prompt 533 tokens long:
For llama-13B q4_0, for the same prompt and same 43 layers offloaded to the RTX 3060 12GB: main PP (533t): PR PP (533t): |
With base llama there are out-of-bounds memory accesses as well; I just haven't gotten around to fixing them yet because they randomly don't matter for my testing. |
This looks so great, I can't wait to get rid of cuBLAS entirely. |
The out-of-bounds memory accesses should be fixed now. |
d8e2697
to
a3b096b
Compare
I am still getting a similar error. I will paste the output in case it is helpful. No layers are being offloaded, just to see if there was a problem with the amount of layers I was offloading to the GPU.
I understand if you are not interested in this particular error and want to concentrate on the matrix x matrix multiplication code first. Also, this is a model with 32001 n_vocab, which also adds extra complexity to the already difficult task you are undertaking. |
84f787e
to
cf0a505
Compare
I implemented all of the older quantization formats and rebased onto master. These are the results:
Using cuBLAS the performance is very consistent. The performance of the new kernels varies a lot depending on quantization format. I think the fundamental reason is that for matrix matrix multiplication you are compute bound and I/O is much less important. So it can be faster to dequantize the entire weight matrix once and then work with f16/f32 values which you can multiply directly (and which is comparatively fast on GPUs) than it is to do multiple integer/logical operations on the quantized data to get the result. Notably, the performance for q5_0 and q5_1 is bad, presumably because the 5th bits are ordered in an inconvenient way that requires 4 bit shifts, bit-wise ANDs, and bit-wise ORs. This could be reduced to a single bit shift, bit-wise AND, and bit-wise OR (could be done when the weights are loaded into VRAM if the different bit order is bad for CPU performance). The performance of q4_1 on the P40 is also relatively bad, the problem being that for good performance some f16 calculations are necessary which have bad performance on the P40 (a f32 workaround is used instead). Overall I expect the performance for q2_k, q3_k, and q5_k to not be good due to the large number of operations per data value that will be necessary. Caveats: the performance can probably still be optimized a lot. In particular I think there is still potential to optimize memory bank conflicts and tile sizes. For Ampere or newer it's also possible to utilize asynchronous data loading. @ggerganov @slaren What is your judgement regarding prompt processing speed vs. VRAM usage? I personally would prefer lower VRAM usage as the default because I find prompt processing to be fast enough either way. Also my findings may apply to a lesser extent to CPU matrix matrix multiplication as well. CPUs generally have comparatively fast integer arithmetic though and I think the memory bandwidth is a much more severe bottleneck for CPUs than for GPUs. |
I think that for llama.cpp it is reasonable to prefer VRAM usage over prompt processing speed by default. In some cases, such as summarization or code completion, prompt processing may be more important. When the backends interface is completed, this could be an option selected at runtime when initializing the ggml-cuda backend (if the binary was built with cuBLAS support enabled). |
Looks like great progress so far.
It's hard to say since prompt processing has it's applications and is also important during perplexity computations, so there is probably no right answer. But, I still think that we can achieve at least parity with cuBLAS performance when using quantized mmm for all types. So, I'm inclined to say that we can accept the pp speed regression in some cases now and hope we will solve it eventually. If we don't succeed, then probably we'll provide options as @slaren suggested. |
I ran a test on falcon 7 and 40 (7 with MATRIX_ROW_PADDING to 64, 40 with default) Here is the shape of a 7B multiplication:
Switching manually to cublas instead of the new mat_q works. |
ggllm.cpp is not setting cmake CUDA architectures correctly, see ggml-org/ggml#389 . The result is that on Pascal or newer the compiled PTX code does not match the runtime check. |
That's interesting. I fixed that for the next update. though it did not change the results on my system) In total I have two problems with implementing the new cuda code, I didn't want to oversaturate the report here with two issues yesterday but they are linked: I noted that the QKV pure vector multiplication with the new direct dequant-kernels failed as well. - Though that did not always happen and only that specific matmul operation, not the others. This changed:
To get closer into 2:
So changing DMVV_X from 64 to 32 made a big difference on both, the dequantization vec mulmat kernels and on the matmul kernel. But it randomly works and fails depending on which shape you are feeding it. |
Sorry but I don't see how this is related to the new kernels I'm implementing. They are only used for quantized data and the KV cache is f16. |
I've spent hours into it, I wouldn't report that as an issue if it was not relevant. Also the other cases where I have similar failures, completely normal multiplications that work perfectly fine using cublas and fail using the new kernel. That's quite easy to test, all you've to do is add "false" into the branch that differentiates between mat_q and cublas in The DMMV_X = 64 causes ALL ggml_cuda_op_mul_mat_q to fail reliably (output tensor zero), at 32 it fails at some of them. I am aware that there is no priority or preference to get this working on Falcon, it's llama.cpp after all. But in my opinion a full matmul replacement should work as reliable as the cublas one. So on all legal tensor input shapes, any non supported shape should ASSERT I'll finish my work on upgrading and cleaning the current ggllm backend, I probably have to make a mix of the current and the previous version. Once that's done I can send you a branch to check out and verify the problem if you are interested. |
Sorry, but I don't intend to provide extended support for |
That's fine. I just wanted to point the problem out. Thanks for the update, most likely it is as you said that the vector and mulmat kernels just don't support the shapes. |
cf0a505
to
5fa1064
Compare
I'll try to finally get this PR in a state that can be merged this weekend. k-quant support is currently still missing. cuBLAS will still be a mandatory dependency because the KV cache needs it. Despite that I plan to make the switch between cuBLAS and the new kernels a compile-time option since my understanding is that the long-term goal is to drop cuBLAS as a mandatory dependency. |
...wait s slower now with rtx 3090 ?? |
Yes, but the VRAM usage is reduced by 700/970/1430 MiB for 7b/13b/33b. Compile with |
Loosing even 50% performance with awesome qk_m models to gain 1GB of VRAM ? |
Prompt processing usually takes up much less time than the actual generation so I think this tradeoff is worthwhile. On an RTX 3090 you can now run 33b with more context or with better quantization. On 16 GB RAM + 8 GB VRAM it should now be possible to run 33b q4 at 2048 context. |
Would it be ok to ship the perplexity tool in the release builds using cublas, while the main/server use the non-cublas kenerls? |
Yeah, that would make sense. |
That may be the case with high end hardware like P40's and 3090's, but not with more common hardware, especially not at a ctx of 2048 and over. For me, prompt processing even with cublas takes a considerable amount of time. With a 2060 and a 13b model, around 30 seconds (1800 tokens), and 60s generation resulting in a total of around 90s (180 tokens generated), so half the time of generation is spent with prompt processing. In my opinion prompt processing speed is far more important than generation, because the time for the AI to answer feels longer, while with generation you can see the tokens generated in real time using token streaming, so slower generation is not a big deal in my eyes. |
until you outperform it, that is 😄 |
I appreciate the work done here, but in my opinion, this should only be the default if generation speed and more importantly, prompt processing speed is on par with cublas for all quantization formats. Otherwise most people will have to wonder why the AI suddenly takes longer to answer. |
Alright, I have tested it now @JohannesGaessler this is the performance on midrange systems without 24 GB VRAM. RTX 2060, 32 GB RAM, Core i7 9750H Old implementation, 13b q5_1, 13 GPU layers, VRAM usage 5,1 GB: New implementation, 17 layers, VRAM Usage 5,2 GB: As I've expected, the slower prompt processing time outweights the faster generation time. (Note I was being generous, by adding 100 MB more VRAM usage and using an older build of llama.cpp with the old implementation). So overall time is indeed slower with the new implementation. Still, I if you chat with the model, which results in around 800 token processed at average instead of 1800, I guess it would end up being faster. But then again, you'd have to wait longer for the generation to start which in my opinion, is less than desireable. For me personally, prompt processing speed as it is with the old implementation, is good for a 13b model on systems with 6 and 8 GB VRAM. But I wouldn't want to run 33b on it because the prompt processing time likely would be twice as slow so it wouldn't result in a good user experience, as you'd have to wait a long time for the generation to start. The new implementation would make it even less possible for me to run 33b models at adequate speed. |
Sorry for so many posts in a row, but I have some more data to share. I've found a good usecase for this implementation on systems with 8 and 6 GB VRAM. q4_0 7b now runs entirely in VRAM. So if you want to run lower quality 7b models on this kind of hardware (or higher quality 7b k_m models on GPUs with 8 GB VRAM), this PR is indeed an excellent option to have, if it were a seperate flag that is not the default used by popular inference programs like text generation webui and koboldcpp. Likewise GPUs with 12 GB VRAM and perhaps even 10 GB GPUs should be able to run 13b models entirely in VRAM comfortably with this PR. However, as many if not most people use llama.cpp to run models too big for their VRAM (which I'd argue partial offloading is llama.cpp's killerfeature) my original point still stands. You won't get full GPU offloading on a 8 GB VRAM GPU with 13b, let alone 33b models even at a ctx of 2048. |
Are there any perplexity value comparisons with the non-QxQ version? Unless I grossly misunderstand how this works, I'd imagine the output of q4 x q4 directly without dequantizing is likely to be severely degraded in precision. Or is the matmul done at some intermediate format? |
There is nothing to apologize for. I chose the default based on the overall goals of the project (to not use external BLAS libraries at all) and the way I use llama.cpp and what I care about when I do. I don't expect people to universally agree with my priorities.
The hidden state is quantized to 8 bits. So the matrix matrix multiplication is for example done using q4 x q8. This is the same way it's done on the CPU (or when using the mul_mat_vec_q kernels) and neither with that or with this implementation I observed worse perplexity. |
The latest update does not run for me on Windows 11 with nvcc 12.2 and 4090, nor with the build released on github. I get the following error: As a workaround, I inserted the following code at
The following did not work:
|
I confirm this behavior (and fix) also running on Windows 11. What is interesting is that Llama-2-13b works fine but openchat_v3.2 does not and asserts and I see they have 32000 and 32002 vocab size respectively - not sure this is relevant or not. |
I have this issue too on the latest commit, which happens when I try to offload the KV buffers. however @dranger003's fix does not seem to work for me, adding the fix at that offset, I get |
Extra info: Model used is Printing debug information for the associated variables:
Edit: apologize for the spam. I have narrowed it down to commit 11f3ca0 causing this issue. Initial state after line 4720: it happens regardless of whether mul_mat_q is set or not. I believe it may be related to |
@LostRuins Thanks for the updates. I ran into this issue once more with another model |
After #2043 and #2067 I've tried implementing a matrix matrix multiplication kernel using integer intrinsics (currently only q4_0). The results are mixed:
On my RTX 3090 the new kernel is slower but on my P40 it's faster. Since matrix matrix multiplications are compute bound I suspect that the reason is that (unlike the P40) the RTX 3090 is capable of executing floating point and integer (i.e. pointer) arithmetic in parallel. So using integer intrinsics instead of floating point operations leaves the floating point hardware underutilized. However, the same GPUs that can execute floating point and integer arithmetic in parallel also have tensor cores which should be much faster anyways so I think that this is not a problem long-term.
Due to the use of shared memory the implementation has gotten rather ugly; using the structs for quantized data in shared memory had terrible performance (most likely due to memory bank conflicts) so the current implementation dissects the data into quants and scales. This also has the unfortunate side effect of tying the allocation of shared memory closely to the quantization type. I very much do not want to implement an entire matrix matrix multiplication kernel for each quantization type but creating a good template could also be tricky; I'll need to think about it some more.