-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only call rocblas_initialize for versions < 4 to eliminate unncessary VRAM allocation on some AMD cards #11080
Only call rocblas_initialize for versions < 4 to eliminate unncessary VRAM allocation on some AMD cards #11080
Conversation
Is there a downside to having this call though? My understanding is that rocBLAS would be initialized anyways once it's being used so you wouldn't be saving any memory. |
I was able to remove all references to rocBLAS from the code and make files. And llama-cli can still use both my GPUs. Does this mean it's no longer needed? |
To my knowledge rocBLAS is still used internally by HIP even if it is not referenced directly. More generally, while the bug seems to have been fixed for v6.0 I have not seen confirmation that it was fixed for v5.7. And I don't think this PR would provide any benefits other than cosmetic ones. So my stance is that the workaround should be kept. |
This fixed a segfault for me when using both a 7900xtx and a 7600xt. |
The difference between calling rocblas_initialize and not calling it is that in the rocblas_initialize case rocblas will load all tensile code objects for all operators gpus in the system while not calling rocblas_initialize will load only the object required for the current operation when that operation is first used. This means not calling rocblas_initialize avoids a small runtime cost as there potentially are operators and or gpus in the system llcpp will never use. As to why this should have an impact on vram usage, tensile may allocate some temporary buffers but they should not last, so i have no idea. @mjtalkiewicz segfault is concerning, his case is special as he has a gpu for which tensile has no logic files to generate asm kernels for. In this case there is a hip fallback library that will be loaded, but this a recent addition and is afiak not tested by amd as part of the ci. in the rocblas_initialize case these will be loaded while in the non rocblas_initialize case if he only uses the xtx this will not happen @mjtalkiewicz please describe in detail what versions of everything you are running and what gpus you are using in the test that fails. I would also try git rocblas and git tensile as the fallback stuff is pretty new. |
I think it's just an issue of when these prints occur. On master they occur after initialization, with this PR they would occur after initialization. |
@mjtalkiewicz How did you mean "This fixed a segfault for me when using both a 7900xtx and a 7600xt" to be interpreted? Do you mean that this PR fixed a segfault or that the workaround that this PR would removed fixed a segfault? |
@JohannesGaessler I meant this PR had fixed the segfault. @IMbackK Updating rocblas fixed the issue, and llama.cpp is now working without this PR. For reference, the cards were an MSI 7900 XTX and a PowerColor 7600 XT, and misbehaving version of rocblas was Fedora 41's rocblas-0:6.2.1-1 |
@JohannesGaessler @IMbackK I added logging to confirm that with the same exact command line arguments 688 MiB more VRAM is free at the end of generation if the rocblas_initialize() call is removed. Can you please look into this again? Wasting a significant amount of VRAM just to maintain an old workaround does not make sense to me. Even if you believe this is workaround is still needed for older cards, it should be under a command line switch that's off by default. With rocblas_initialize(): Without rocblas_initialize(): Command line to reproduce the results: |
Thats a bit weird and maybe a bug, i cant seam reproduce it on mi100. Regardless of the vram issue, not calling rocblas_initialize would in be a good thing to avoid the startup time cost of unused tensile objects. However, since there are still users of rocm <6, like debian, which still packages 5.7, we should not simply remove the workaround but instead call rocblas_get_version_string and match against any version with the bug (<4.0.0) and only call rocblas_initialize there. |
@IMbackK I updated the PR with your recommendation. Please take a look. Thank you! |
Not sure why I was marked as reviewer here, I don't know about AMD part |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think we should at least issue a debug print when the workaround is in play
ggml/src/ggml-cuda/ggml-cuda.cu
Outdated
rocblas_initialize(); | ||
CUDA_CHECK(cudaDeviceSynchronize()); | ||
{ | ||
char version_string[64]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is fine it would be slightly better here to use rocblas_get_version_string_size to let rocblas tell you how big the buffer needs to be.
ggml/src/ggml-cuda/ggml-cuda.cu
Outdated
char version_string[64]; | ||
version_string[0] = '\0'; | ||
const rocblas_status status = rocblas_get_version_string(version_string, sizeof(version_string)); | ||
if (status != rocblas_status_success || version_string[0] < '4') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't like this too much as this will ofc fail if rocblas changes its version to 10.0.0 or whatever. I think we should make a bit more effort to parse this properly. Looking at rocblas (https://github.com/ROCm/rocBLAS/blob/59825a7367a24eed4e7e8a483820592089eaf17e/library/src/buildinfo.cpp#L29) it seams we would be on the safe side to use string_split<int>
here
Line 439 in df984e0
static std::vector<T> string_split(const std::string & str, char delim) { |
however currently common.h is not used outside of the clients/examples and contains code that makes no sense in the backend.
@ggerganov maybe you can weigh in on if its ok to use this header here or if we should move function somewhere else.
Actually the extra vram use of rocblas_initialize seams to be normal and expected. When we don't call rocblas_initialize rocblas will only load the code objects and workspaces for the operations actually used, so we expect this to save some vram. So i can confirm @sARY77 observations. |
@IMbackK I addressed your feedback to call rocblas_get_version_string_size, parse the full major version value (std::from_chars stops at the first invalid character and does not throw any exceptions), and add GGML_LOG_DEBUG in the new iteration, please take a look. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me now
I have two identical AMD GPUs and noticed the discrepancy in the free memory reported.
I traced it down to the rocblas_initialize call after which the VRAM usage on one of the GPUs jumps up by 498 MiB.
PR that introduced the workaround:
ROCm Port
According to the discussion of the ROCm issue that made the workaround necessary, it has been resolved a long time ago:
[Bug]: Incorrect results when using GPUs with different architectures
I tested my change on a model that offloads about 20 GiB on each of the GPUs and did not notice any differences.
I also ran the CI and there and there we no failures reported.
Before:
llama_load_model_from_file: using device ROCm0 (Radeon RX 7900 XTX) - 24026 MiB free
llama_load_model_from_file: using device ROCm1 (Radeon RX 7900 XTX) - 24524 MiB free
After:
llama_load_model_from_file: using device ROCm0 (Radeon RX 7900 XTX) - 24524 MiB free
llama_load_model_from_file: using device ROCm1 (Radeon RX 7900 XTX) - 24524 MiB free