Only call rocblas_initialize for versions < 4 to eliminate unncessary VRAM allocation on some AMD cards #11080

sARY77 · 2025-01-05T01:11:22Z

I have two identical AMD GPUs and noticed the discrepancy in the free memory reported.
I traced it down to the rocblas_initialize call after which the VRAM usage on one of the GPUs jumps up by 498 MiB.

PR that introduced the workaround:
ROCm Port

According to the discussion of the ROCm issue that made the workaround necessary, it has been resolved a long time ago:
[Bug]: Incorrect results when using GPUs with different architectures

I tested my change on a model that offloads about 20 GiB on each of the GPUs and did not notice any differences.
I also ran the CI and there and there we no failures reported.

Before:
llama_load_model_from_file: using device ROCm0 (Radeon RX 7900 XTX) - 24026 MiB free
llama_load_model_from_file: using device ROCm1 (Radeon RX 7900 XTX) - 24524 MiB free
After:
llama_load_model_from_file: using device ROCm0 (Radeon RX 7900 XTX) - 24524 MiB free
llama_load_model_from_file: using device ROCm1 (Radeon RX 7900 XTX) - 24524 MiB free

JohannesGaessler · 2025-01-05T20:28:46Z

Is there a downside to having this call though? My understanding is that rocBLAS would be initialized anyways once it's being used so you wouldn't be saving any memory.

sARY77 · 2025-01-06T04:19:51Z

Is there a downside to having this call though? My understanding is that rocBLAS would be initialized anyways once it's being used so you wouldn't be saving any memory.

I was able to remove all references to rocBLAS from the code and make files. And llama-cli can still use both my GPUs. Does this mean it's no longer needed?

…IP_workaround

JohannesGaessler · 2025-01-06T22:36:01Z

To my knowledge rocBLAS is still used internally by HIP even if it is not referenced directly. More generally, while the bug seems to have been fixed for v6.0 I have not seen confirmation that it was fixed for v5.7. And I don't think this PR would provide any benefits other than cosmetic ones. So my stance is that the workaround should be kept.

mjtalkiewicz · 2025-01-22T15:22:36Z

This fixed a segfault for me when using both a 7900xtx and a 7600xt.

IMbackK · 2025-01-24T10:53:55Z

The difference between calling rocblas_initialize and not calling it is that in the rocblas_initialize case rocblas will load all tensile code objects for all operators gpus in the system while not calling rocblas_initialize will load only the object required for the current operation when that operation is first used. This means not calling rocblas_initialize avoids a small runtime cost as there potentially are operators and or gpus in the system llcpp will never use.

As to why this should have an impact on vram usage, tensile may allocate some temporary buffers but they should not last, so i have no idea.

@mjtalkiewicz segfault is concerning, his case is special as he has a gpu for which tensile has no logic files to generate asm kernels for. In this case there is a hip fallback library that will be loaded, but this a recent addition and is afiak not tested by amd as part of the ci. in the rocblas_initialize case these will be loaded while in the non rocblas_initialize case if he only uses the xtx this will not happen

@mjtalkiewicz please describe in detail what versions of everything you are running and what gpus you are using in the test that fails. I would also try git rocblas and git tensile as the fallback stuff is pretty new.

JohannesGaessler · 2025-01-24T13:14:30Z

As to why this should have an impact on vram usage, tensile may allocate some temporary buffers but they should not last, so i have no idea.

I think it's just an issue of when these prints occur. On master they occur after initialization, with this PR they would occur after initialization.

JohannesGaessler · 2025-01-24T13:15:52Z

@mjtalkiewicz How did you mean "This fixed a segfault for me when using both a 7900xtx and a 7600xt" to be interpreted? Do you mean that this PR fixed a segfault or that the workaround that this PR would removed fixed a segfault?

mjtalkiewicz · 2025-01-24T16:20:22Z

@JohannesGaessler I meant this PR had fixed the segfault.

@IMbackK Updating rocblas fixed the issue, and llama.cpp is now working without this PR.

For reference, the cards were an MSI 7900 XTX and a PowerColor 7600 XT, and misbehaving version of rocblas was Fedora 41's rocblas-0:6.2.1-1

…IP_workaround

sARY77 · 2025-01-26T05:11:56Z

@JohannesGaessler @IMbackK I added logging to confirm that with the same exact command line arguments 688 MiB more VRAM is free at the end of generation if the rocblas_initialize() call is removed. Can you please look into this again?

Wasting a significant amount of VRAM just to maintain an old workaround does not make sense to me. Even if you believe this is workaround is still needed for older cards, it should be under a command line switch that's off by default.

With rocblas_initialize():
llama_model_load_from_file_impl: using device ROCm0 (Radeon RX 7900 XTX) - 24026 MiB free
llama_model_load_from_file_impl: using device ROCm1 (Radeon RX 7900 XTX) - 24524 MiB free
...
main: using device ROCm0 (Radeon RX 7900 XTX) - 3904 MiB free
main: using device ROCm1 (Radeon RX 7900 XTX) - 3992 MiB free

Without rocblas_initialize():
llama_model_load_from_file_impl: using device ROCm0 (Radeon RX 7900 XTX) - 24524 MiB free
llama_model_load_from_file_impl: using device ROCm1 (Radeon RX 7900 XTX) - 24524 MiB free
...
main: using device ROCm0 (Radeon RX 7900 XTX) - 4248 MiB free
main: using device ROCm1 (Radeon RX 7900 XTX) - 4336 MiB free

Command line to reproduce the results:
./build/bin/llama-cli -m ./models/Llama-3.3-70B-Instruct-Q4_K_M-128K.gguf -dev ROCm0,ROCm1 -ngl 99 -c 32 -s 0 --sampling-seq k --top-k 1 -p "2 * 2 = " -n 1 -no-cnv

IMbackK · 2025-01-26T21:58:43Z

Thats a bit weird and maybe a bug, i cant seam reproduce it on mi100. Regardless of the vram issue, not calling rocblas_initialize would in be a good thing to avoid the startup time cost of unused tensile objects.

However, since there are still users of rocm <6, like debian, which still packages 5.7, we should not simply remove the workaround but instead call rocblas_get_version_string and match against any version with the bug (<4.0.0) and only call rocblas_initialize there.

…IP_workaround

sARY77 · 2025-01-26T22:45:36Z

@IMbackK I updated the PR with your recommendation. Please take a look. Thank you!

ngxson · 2025-01-26T23:36:07Z

Not sure why I was marked as reviewer here, I don't know about AMD part

IMbackK

I also think we should at least issue a debug print when the workaround is in play

IMbackK · 2025-01-27T11:21:06Z

ggml/src/ggml-cuda/ggml-cuda.cu

-    rocblas_initialize();
-    CUDA_CHECK(cudaDeviceSynchronize());
+    {
+        char version_string[64];


While this is fine it would be slightly better here to use rocblas_get_version_string_size to let rocblas tell you how big the buffer needs to be.

IMbackK · 2025-01-27T11:24:12Z

ggml/src/ggml-cuda/ggml-cuda.cu

+        char version_string[64];
+        version_string[0] = '\0';
+        const rocblas_status status = rocblas_get_version_string(version_string, sizeof(version_string));
+        if (status != rocblas_status_success || version_string[0] < '4') {


i don't like this too much as this will ofc fail if rocblas changes its version to 10.0.0 or whatever. I think we should make a bit more effort to parse this properly. Looking at rocblas (https://github.com/ROCm/rocBLAS/blob/59825a7367a24eed4e7e8a483820592089eaf17e/library/src/buildinfo.cpp#L29) it seams we would be on the safe side to use string_split<int> here

llama.cpp/common/common.h

Line 439 in df984e0

static std::vector<T> string_split(const std::string & str, char delim) {

however currently common.h is not used outside of the clients/examples and contains code that makes no sense in the backend.
@ggerganov maybe you can weigh in on if its ok to use this header here or if we should move function somewhere else.

IMbackK · 2025-01-27T17:03:05Z

As to why this should have an impact on vram usage, tensile may allocate some temporary buffers but they should not last, so i have no idea.

Actually the extra vram use of rocblas_initialize seams to be normal and expected.
So rocblas moves the kernel code and some workspaces into vram when the tensile object initializes, these are not freed, by design.
This vram cost is alos mesured by rocblas itself in its clients here https://github.com/ROCm/rocBLAS/blob/59825a7367a24eed4e7e8a483820592089eaf17e/clients/common/client_utility.cpp#L500 and indeed its about 500MB for RDNA2, while being smaller for mi100.

When we don't call rocblas_initialize rocblas will only load the code objects and workspaces for the operations actually used, so we expect this to save some vram.

So i can confirm @sARY77 observations.

sARY77 · 2025-01-28T05:20:43Z

@IMbackK I addressed your feedback to call rocblas_get_version_string_size, parse the full major version value (std::from_chars stops at the first invalid character and does not throw any exceptions), and add GGML_LOG_DEBUG in the new iteration, please take a look. Thank you!

…IP_workaround

IMbackK

Looks good to me now

Remove obsolete HIP workaround

9ad2e7d

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 5, 2025

Remove more references to rocBLAS

7aba1f9

sARY77 requested a review from ngxson as a code owner January 6, 2025 04:12

github-actions bot added build Compilation issues nix Issues specific to consuming flake.nix, or generally concerned with ❄ Nix-based llama.cpp deployment devops improvements to build systems and github actions labels Jan 6, 2025

Merge remote-tracking branch 'upstream/master' into Remove_obsolete_H…

8d01c89

…IP_workaround

sARY77 added 2 commits January 25, 2025 16:46

Merge remote-tracking branch 'upstream/master' into Remove_obsolete_H…

7088822

…IP_workaround

Temporarily add logging of free device memory at the end of main

cbf779c

github-actions bot added the examples label Jan 26, 2025

sARY77 added 2 commits January 26, 2025 14:16

Merge remote-tracking branch 'upstream/master' into Remove_obsolete_H…

9c27481

…IP_workaround

Address PR feedback

bb37819

sARY77 changed the title ~~Remove obsolete HIP workaround~~ Only use rocBLAS workaround for versions < 4 to eliminate unncessary VRAM allocation on some AMD cards Jan 26, 2025

sARY77 changed the title ~~Only use rocBLAS workaround for versions < 4 to eliminate unncessary VRAM allocation on some AMD cards~~ Only call rocblas_initialize for versions < 4 to eliminate unncessary VRAM allocation on some AMD cards Jan 26, 2025

ngxson removed their request for review January 26, 2025 23:36

IMbackK reviewed Jan 27, 2025

View reviewed changes

Address code review feedback

61d341f

sARY77 requested a review from IMbackK January 28, 2025 05:22

Merge remote-tracking branch 'upstream/master' into Remove_obsolete_H…

5fde721

…IP_workaround

IMbackK approved these changes Jan 28, 2025

View reviewed changes

IMbackK merged commit cae9fb4 into ggerganov:master Jan 28, 2025
45 checks passed

sARY77 deleted the Remove_obsolete_HIP_workaround branch January 28, 2025 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only call rocblas_initialize for versions < 4 to eliminate unncessary VRAM allocation on some AMD cards #11080

Only call rocblas_initialize for versions < 4 to eliminate unncessary VRAM allocation on some AMD cards #11080

sARY77 commented Jan 5, 2025

JohannesGaessler commented Jan 5, 2025

sARY77 commented Jan 6, 2025

JohannesGaessler commented Jan 6, 2025

mjtalkiewicz commented Jan 22, 2025

IMbackK commented Jan 24, 2025 •

edited

Loading

JohannesGaessler commented Jan 24, 2025

JohannesGaessler commented Jan 24, 2025

mjtalkiewicz commented Jan 24, 2025

sARY77 commented Jan 26, 2025

IMbackK commented Jan 26, 2025 •

edited

Loading

sARY77 commented Jan 26, 2025

ngxson commented Jan 26, 2025

IMbackK left a comment

IMbackK Jan 27, 2025

IMbackK Jan 27, 2025 •

edited

Loading

IMbackK commented Jan 27, 2025 •

edited

Loading

sARY77 commented Jan 28, 2025

IMbackK left a comment

Only call rocblas_initialize for versions < 4 to eliminate unncessary VRAM allocation on some AMD cards #11080

Only call rocblas_initialize for versions < 4 to eliminate unncessary VRAM allocation on some AMD cards #11080

Conversation

sARY77 commented Jan 5, 2025

JohannesGaessler commented Jan 5, 2025

sARY77 commented Jan 6, 2025

JohannesGaessler commented Jan 6, 2025

mjtalkiewicz commented Jan 22, 2025

IMbackK commented Jan 24, 2025 • edited Loading

JohannesGaessler commented Jan 24, 2025

JohannesGaessler commented Jan 24, 2025

mjtalkiewicz commented Jan 24, 2025

sARY77 commented Jan 26, 2025

IMbackK commented Jan 26, 2025 • edited Loading

sARY77 commented Jan 26, 2025

ngxson commented Jan 26, 2025

IMbackK left a comment

Choose a reason for hiding this comment

IMbackK Jan 27, 2025

Choose a reason for hiding this comment

IMbackK Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

IMbackK commented Jan 27, 2025 • edited Loading

sARY77 commented Jan 28, 2025

IMbackK left a comment

Choose a reason for hiding this comment

IMbackK commented Jan 24, 2025 •

edited

Loading

IMbackK commented Jan 26, 2025 •

edited

Loading

IMbackK Jan 27, 2025 •

edited

Loading

IMbackK commented Jan 27, 2025 •

edited

Loading