Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only call rocblas_initialize for versions < 4 to eliminate unncessary VRAM allocation on some AMD cards #11080

Merged
merged 9 commits into from
Jan 28, 2025

Conversation

sARY77
Copy link
Contributor

@sARY77 sARY77 commented Jan 5, 2025

I have two identical AMD GPUs and noticed the discrepancy in the free memory reported.
I traced it down to the rocblas_initialize call after which the VRAM usage on one of the GPUs jumps up by 498 MiB.

PR that introduced the workaround:
ROCm Port

According to the discussion of the ROCm issue that made the workaround necessary, it has been resolved a long time ago:
[Bug]: Incorrect results when using GPUs with different architectures

I tested my change on a model that offloads about 20 GiB on each of the GPUs and did not notice any differences.
I also ran the CI and there and there we no failures reported.

Before:
llama_load_model_from_file: using device ROCm0 (Radeon RX 7900 XTX) - 24026 MiB free
llama_load_model_from_file: using device ROCm1 (Radeon RX 7900 XTX) - 24524 MiB free
After:
llama_load_model_from_file: using device ROCm0 (Radeon RX 7900 XTX) - 24524 MiB free
llama_load_model_from_file: using device ROCm1 (Radeon RX 7900 XTX) - 24524 MiB free

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 5, 2025
@JohannesGaessler
Copy link
Collaborator

Is there a downside to having this call though? My understanding is that rocBLAS would be initialized anyways once it's being used so you wouldn't be saving any memory.

@sARY77 sARY77 requested a review from ngxson as a code owner January 6, 2025 04:12
@github-actions github-actions bot added build Compilation issues nix Issues specific to consuming flake.nix, or generally concerned with ❄ Nix-based llama.cpp deployment devops improvements to build systems and github actions labels Jan 6, 2025
@sARY77
Copy link
Contributor Author

sARY77 commented Jan 6, 2025

Is there a downside to having this call though? My understanding is that rocBLAS would be initialized anyways once it's being used so you wouldn't be saving any memory.

I was able to remove all references to rocBLAS from the code and make files. And llama-cli can still use both my GPUs. Does this mean it's no longer needed?

@JohannesGaessler
Copy link
Collaborator

To my knowledge rocBLAS is still used internally by HIP even if it is not referenced directly. More generally, while the bug seems to have been fixed for v6.0 I have not seen confirmation that it was fixed for v5.7. And I don't think this PR would provide any benefits other than cosmetic ones. So my stance is that the workaround should be kept.

@mjtalkiewicz
Copy link

This fixed a segfault for me when using both a 7900xtx and a 7600xt.

@IMbackK
Copy link
Collaborator

IMbackK commented Jan 24, 2025

The difference between calling rocblas_initialize and not calling it is that in the rocblas_initialize case rocblas will load all tensile code objects for all operators gpus in the system while not calling rocblas_initialize will load only the object required for the current operation when that operation is first used. This means not calling rocblas_initialize avoids a small runtime cost as there potentially are operators and or gpus in the system llcpp will never use.

As to why this should have an impact on vram usage, tensile may allocate some temporary buffers but they should not last, so i have no idea.

@mjtalkiewicz segfault is concerning, his case is special as he has a gpu for which tensile has no logic files to generate asm kernels for. In this case there is a hip fallback library that will be loaded, but this a recent addition and is afiak not tested by amd as part of the ci. in the rocblas_initialize case these will be loaded while in the non rocblas_initialize case if he only uses the xtx this will not happen

@mjtalkiewicz please describe in detail what versions of everything you are running and what gpus you are using in the test that fails. I would also try git rocblas and git tensile as the fallback stuff is pretty new.

@JohannesGaessler
Copy link
Collaborator

As to why this should have an impact on vram usage, tensile may allocate some temporary buffers but they should not last, so i have no idea.

I think it's just an issue of when these prints occur. On master they occur after initialization, with this PR they would occur after initialization.

@JohannesGaessler
Copy link
Collaborator

@mjtalkiewicz How did you mean "This fixed a segfault for me when using both a 7900xtx and a 7600xt" to be interpreted? Do you mean that this PR fixed a segfault or that the workaround that this PR would removed fixed a segfault?

@mjtalkiewicz
Copy link

@JohannesGaessler I meant this PR had fixed the segfault.

@IMbackK Updating rocblas fixed the issue, and llama.cpp is now working without this PR.

For reference, the cards were an MSI 7900 XTX and a PowerColor 7600 XT, and misbehaving version of rocblas was Fedora 41's rocblas-0:6.2.1-1

@sARY77
Copy link
Contributor Author

sARY77 commented Jan 26, 2025

@JohannesGaessler @IMbackK I added logging to confirm that with the same exact command line arguments 688 MiB more VRAM is free at the end of generation if the rocblas_initialize() call is removed. Can you please look into this again?

Wasting a significant amount of VRAM just to maintain an old workaround does not make sense to me. Even if you believe this is workaround is still needed for older cards, it should be under a command line switch that's off by default.

With rocblas_initialize():
llama_model_load_from_file_impl: using device ROCm0 (Radeon RX 7900 XTX) - 24026 MiB free
llama_model_load_from_file_impl: using device ROCm1 (Radeon RX 7900 XTX) - 24524 MiB free
...
main: using device ROCm0 (Radeon RX 7900 XTX) - 3904 MiB free
main: using device ROCm1 (Radeon RX 7900 XTX) - 3992 MiB free

Without rocblas_initialize():
llama_model_load_from_file_impl: using device ROCm0 (Radeon RX 7900 XTX) - 24524 MiB free
llama_model_load_from_file_impl: using device ROCm1 (Radeon RX 7900 XTX) - 24524 MiB free
...
main: using device ROCm0 (Radeon RX 7900 XTX) - 4248 MiB free
main: using device ROCm1 (Radeon RX 7900 XTX) - 4336 MiB free

Command line to reproduce the results:
./build/bin/llama-cli -m ./models/Llama-3.3-70B-Instruct-Q4_K_M-128K.gguf -dev ROCm0,ROCm1 -ngl 99 -c 32 -s 0 --sampling-seq k --top-k 1 -p "2 * 2 = " -n 1 -no-cnv

@IMbackK
Copy link
Collaborator

IMbackK commented Jan 26, 2025

Thats a bit weird and maybe a bug, i cant seam reproduce it on mi100. Regardless of the vram issue, not calling rocblas_initialize would in be a good thing to avoid the startup time cost of unused tensile objects.

However, since there are still users of rocm <6, like debian, which still packages 5.7, we should not simply remove the workaround but instead call rocblas_get_version_string and match against any version with the bug (<4.0.0) and only call rocblas_initialize there.

@sARY77
Copy link
Contributor Author

sARY77 commented Jan 26, 2025

@IMbackK I updated the PR with your recommendation. Please take a look. Thank you!

@sARY77 sARY77 changed the title Remove obsolete HIP workaround Only use rocBLAS workaround for versions < 4 to eliminate unncessary VRAM allocation on some AMD cards Jan 26, 2025
@sARY77 sARY77 changed the title Only use rocBLAS workaround for versions < 4 to eliminate unncessary VRAM allocation on some AMD cards Only call rocblas_initialize for versions < 4 to eliminate unncessary VRAM allocation on some AMD cards Jan 26, 2025
@ngxson
Copy link
Collaborator

ngxson commented Jan 26, 2025

Not sure why I was marked as reviewer here, I don't know about AMD part

@ngxson ngxson removed their request for review January 26, 2025 23:36
Copy link
Collaborator

@IMbackK IMbackK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think we should at least issue a debug print when the workaround is in play

rocblas_initialize();
CUDA_CHECK(cudaDeviceSynchronize());
{
char version_string[64];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is fine it would be slightly better here to use rocblas_get_version_string_size to let rocblas tell you how big the buffer needs to be.

char version_string[64];
version_string[0] = '\0';
const rocblas_status status = rocblas_get_version_string(version_string, sizeof(version_string));
if (status != rocblas_status_success || version_string[0] < '4') {
Copy link
Collaborator

@IMbackK IMbackK Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't like this too much as this will ofc fail if rocblas changes its version to 10.0.0 or whatever. I think we should make a bit more effort to parse this properly. Looking at rocblas (https://github.com/ROCm/rocBLAS/blob/59825a7367a24eed4e7e8a483820592089eaf17e/library/src/buildinfo.cpp#L29) it seams we would be on the safe side to use string_split<int> here

static std::vector<T> string_split(const std::string & str, char delim) {

however currently common.h is not used outside of the clients/examples and contains code that makes no sense in the backend.
@ggerganov maybe you can weigh in on if its ok to use this header here or if we should move function somewhere else.

@IMbackK
Copy link
Collaborator

IMbackK commented Jan 27, 2025

As to why this should have an impact on vram usage, tensile may allocate some temporary buffers but they should not last, so i have no idea.

Actually the extra vram use of rocblas_initialize seams to be normal and expected.
So rocblas moves the kernel code and some workspaces into vram when the tensile object initializes, these are not freed, by design.
This vram cost is alos mesured by rocblas itself in its clients here https://github.com/ROCm/rocBLAS/blob/59825a7367a24eed4e7e8a483820592089eaf17e/clients/common/client_utility.cpp#L500 and indeed its about 500MB for RDNA2, while being smaller for mi100.

When we don't call rocblas_initialize rocblas will only load the code objects and workspaces for the operations actually used, so we expect this to save some vram.

So i can confirm @sARY77 observations.

@sARY77
Copy link
Contributor Author

sARY77 commented Jan 28, 2025

@IMbackK I addressed your feedback to call rocblas_get_version_string_size, parse the full major version value (std::from_chars stops at the first invalid character and does not throw any exceptions), and add GGML_LOG_DEBUG in the new iteration, please take a look. Thank you!

@sARY77 sARY77 requested a review from IMbackK January 28, 2025 05:22
Copy link
Collaborator

@IMbackK IMbackK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now

@IMbackK IMbackK merged commit cae9fb4 into ggerganov:master Jan 28, 2025
45 checks passed
@sARY77 sARY77 deleted the Remove_obsolete_HIP_workaround branch January 28, 2025 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues devops improvements to build systems and github actions examples ggml changes relating to the ggml tensor library for machine learning nix Issues specific to consuming flake.nix, or generally concerned with ❄ Nix-based llama.cpp deployment Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants