-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMD gfx1103 laptop GPU returning HIPBLAS_STATUS_UNKNOWN
#188
Comments
Thanks for filing a new issue. I noticed two things orthogonal to your issue, but thanks to your issue, that could be improved. I've done so in 15e2339 and 67d97b5. Could you build llamafile at HEAD and try again? You should get the same error, but it'll give us something better to go on than "unknown error".
Thanks! |
Here's the output. I chose to use mistral model instead, considering only 16G VRAM allocated to GPU on this 7840U laptop:
Since it seems to related with
Let me know anything else I could provide to assist diagnostic. Thanks, |
HIPBLAS_STATUS_UNKNOWN
Earlier @cgmb said in #122 (comment) said your gfx1103 laptop gpu "will only work for programs that limit themselves to a compatible subset of the ISA". In that case I'd ask @cgmb if hipBLAS returning Also, @lovenemesis, I noticed you passed |
Regardless of setting
If I rebooted into Win11 X64 on the same laptop, I could offload to the exact gfx1103 GPU on llamafile 0.6 release without any tweaking. Hence, I assume the GPU itself should retain the capability to handle workload. It works well with PyTorch + ROCm, too. |
I don't know. When you run machine code that contains instructions that don't exist (or don't mean the same thing as the compiler expected) on the processor you're running it on... you've entered the land of undefined behavior. It's plausible that could be your problem, but it could be something else entirely. As far as I know, AMD does not test ROCm on gfx1103 hardware.
rocBLAS/hipBLAS was not designed to support gfx1103, regardless of whether In the future, I'm hoping that the 'generic' ISAs proposed for introduction into LLVM will avoid the need for ugly hacks like |
If it helps, I could run the program again on a desktop equipped with RX 7800XT(gfx1101) to see it if things are different. Correct me if I'm wrong, the reason why llamafile 0.6 could offload to gfx1103 on Win11 X64 is because of the usage of tinyBLAS, not rocBLAS/hipBLAS. Right? Thanks, |
Below is the result of running on RX 7800XT(gfx1101) using Fedora 39. A different module is used, but it shouldn't matter in this case.
Though 7800XT isn't officially listed, it should share more similarity with 7900XT/XTX than 780M. Note I'm using the Fedora packaged ROCm 5.7.1 as described here. Perhaps I may jump too early as it's officially ready in Fedora 40, but if this turned out to be a packaging related issue, I'm happy to bring this to the attention of the package maintainer. Please let me know if additional information I can provide to help narrow down the potential factor. |
That would be interesting. Could you try passing the |
This one works on Linux!
Eval time is much faster, 2.3X times of CPU only mode. |
Following the similar method, it works on Fedora 39 with RX 7800XT, too:
Eval time is 8X times of CPU only mode( 5700G is a weaker one compared to 7840U in terms of CPU performance). |
I get the same issue on AMD, first time report error, next time wont.
The file log below. Anyway you save my life. Great job @jart and your team. What you did is so amazing. ❤️ ❤️ ❤️ ❤️ ❤️ I give you 1000 ❤️ for this project |
Happy to hear it! cc team: @stlhood @mrdomino @ahgamut @jammm
It looks like everything is working as intended and successful in the left side of your screenshot. tinyBLAS isn't as fast as rocBLAS yet, so we still try to build you a native rocBLAS library if you have ROCm installed. Then it falls back to the builtin tinyBLAS DSO when it couldn't be found. I'm not sure why it'd fall back to CPU inference for your first run though. I'll try uninstalling ROCm on my machine later to recreate that failure path. |
In #92, you mentioned positive result while using a RX 6800 on Windows. May I know if this card also works in your setup under Linux, assuming ROCm was installed via AMD packages on the official supported distro (Ubuntu 22.04)? If your setup work while mine (ROCm SDK repacked in Fedora) and this #214 (ROCm SDK repacked in Arch) don't, this could be a packaging issue specific to those distros. I'm happy to let the distro packagers aware. Thanks a lot for all the work! |
HIP issue should read about this: @lovenemesis this is for windows note. But you can take a look. They have AMD device which successful deploy on fedora ggerganov/llama.cpp#1087 (comment) |
My latest Linux computer has AMD Radeon RX 7900 XTX and runs Debian 12. Note that while Debian isn't listed in AMD's list of supported OSes, Ubuntu 22.04 is listed and Debian 12 is what it's based off, so I'd assume Debian is supported by the transitive property. https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html I haven't tried putting my AMD Radeon RX 6800 in my Linux computer. I would assume it would work fine if I did. I used the official ROCm installer. Could you confirm that 67d97b5 earlier was successful in automating the compilation for your Fedora environment? I'd assume that's all you needed. It hasn't been rolled out into a release yet. |
Besides this, it would be great to allow a fallback path to the built-in tinyBLAS kernel if hipBLAS/rocmBLAS didn't work under Linux, in similar to Windows. Hope I'm not asking too much.
My thoughts behind this ask is that if that unlisted RX6800 works with official ROCm installer on Linux, the issue for my unlisted RX 7800XT not working may fall in the packaging side on Fedora. In this case, I could use a bit hint on the parts requiring further look. At the beginning of my output, there are a few errors printed about:
Was Meanwhile, I will try to install a Debian 12 or Ubuntu 22.04 on an external hard drive to test the official ROCm installer. |
I don't think we should color too far outside the lines AMD has drawn. AMD Radeon RX 6800 is the lowest card in AMD's support vector for Windows for HIP SDK development https://rocm.docs.amd.com/projects/install-on-windows/en/latest/reference/system-requirements.html and llamafile is able to support a lot more cards than that on Windows because we ship prebuilt binaries that only depend on the graphics driver. AMD makes computers like this for Linux users: But they'll let you squeak in with a Radeon RX 7900 XTX https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html llamafile won't ship prebuilt gpu dynamic shared objects to Linux users, so we depend on the HIP SDK being available. Therefore, any hardware more than that, which we somehow manage to support, is both a hack and a gift. |
@jart I have a weird behavior on TinyBlast, but I want to make sure that not a bug, can you take a look on this issue: ggerganov/llama.cpp#3969 (comment) |
@hiepxanh Have you tried setting the |
@jart it seem not working, but I also see that it have issue on only few model, phi-2 work perfect. I belive it maybe by the model quantization issue. NVIDIA card have same issue. |
The Fedora packagers might be interested to know if you're having trouble with rocBLAS on the RX 7800 XT. IIRC, that was one of the GPUs they were looking into using for their local workstations.
A little over a year ago, AMD donated four RX 6800 GPUs to the Debian AI team to help them with packaging. I didn't get rocBLAS packaged for Debian in time for Bookworm, but the RX 6800 is perhaps the best-tested AMD GPU for the ROCm packages on Trixie. I am currently trying to migrate Debian's ROCm packages from LLVM 15 to LLVM 17 so we can enable RDNA 3 support on Trixie. Once that's done, I think we'll start looking at backports. I expect that you'll be able to install librocblas-dev enabled for all discrete RDNA 3 GPUs from bookworm-backports later this year. |
@cgmb I'm seeing you are working at AMD, which is really cool. I have a small thought, I know it stupid. But as I know, the AMD currently lack of resource so they decide to push on Enterprise which is fine. But for AMD user and fans, it will be great if they can test some model more easier. |
Even if it were possible to distribute prebuilt tinyBLAS .so files that worked on multiple Linux distros, we simply don't have room. There's 4.5mb of space left in the the LLaVA llamafile, which is 3.995557952672243 gibibbytes. We'd need another 23mb at least to distribute .so files for AMD and NVIDIA. |
Now may not be the time for it (and this likely isn't the thread for it), but on-GPU decompression is a thing that games have cared about for a while. It's possible we can benefit from those same routines. We might be able to compress the weights during preprocessing and then decompress them at load time. This would of course complicate a number of existing code paths, but there may come a time when it's worth the trade. |
@ajbouh tinyBLAS carves weight tensors up into 2d blocks. @ahgamut what is the bottleneck on gpu right now? is it computation or memory bandwidth? If it's the latter, then would it be possible to read variable-length 2d blocks, huffman decode them, run length decode them, apply a zigzag transform, then an inverse discrete cosine transform, and finally q8 dequantize? |
We'd have to profile to be sure. |
Loading times are almost certainly bottlenecked by bandwidth, aren't they?
There's also the challenge of fitting below the 4GB limit that windows
users are facing.
It would be a cool thing to reuse the same texture compression techniques
that game developers have refined to push the boundaries of what's possible
in AI consumer hardware
|
If we could just gzip the byte stream I'd do it, but the entropy is too high for weights. OTOH using texture compression or other forms of image compression would require changing the gguf file format I think. That's not worth it for optimizing loading time IMHO. What I'd like to see happen is for GPUs to support mmap() where the kernel can track read-only pages that map to a file on disk. That would turn loading into a one time cost when memory is free. Right now the closest thing I've seen to being able to do that is devices like Jetson that have unified memory. Then I can just say: void *map = mmap(0, N*sizeof(int), PROT_READ, MAP_SHARED, fd, 0);
CUDA_OR_DIE(cudaHostRegister(map, N*sizeof(int), CU_MEMHOSTREGISTER_READ_ONLY));
CUDA_OR_DIE(cudaHostGetDevicePointer((void **)&x, map, 0)); And it loads almost as fast as mmap(). I want it because I'd rather have ephemeral processes with a smart kernel, than needing to run my own daemon that I talk to over HTTP/JSON. |
At the risk of crossing the streams, it seems that Nvidia claims to have a
variety of standard compression algorithms that run on GPU including
snappy, zstd, and lz4: https://developer.nvidia.com/nvcomp
…On Sat, Jan 20, 2024, 14:57 Justine Tunney ***@***.***> wrote:
If we could just gzip the byte stream I'd do it. But using texture
compression or other forms of image compression would require changing the
gguf file format I think. That's not worth it for optimizing loading time
IMHO.
What I'd like to see happen is for GPUs to support to mmap() where the
kernel can track read-only pages that map to a file on disk. That would
turn loading into a one time cost when memory is free. Right now the
closest thing I've seen to being able to do that is devices like Jetson
that have unified memory. Then I can just say:
void *map = mmap(0, N*sizeof(int), PROT_READ, MAP_SHARED, fd, 0);
CUDA_OR_DIE(cudaHostRegister(map, N*sizeof(int), CU_MEMHOSTREGISTER_READ_ONLY));
CUDA_OR_DIE(cudaHostGetDevicePointer((void **)&x, map, 0));
And it loads almost as fast as mmap(). I want it because I'd rather have
ephemeral processes with a smart kernel, than needing to run my own daemon
that I talk to over HTTP/JSON.
—
Reply to this email directly, view it on GitHub
<#188 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAABZHOKBQ42USMKCVEVUSDYPQOSZAVCNFSM6AAAAABBUYZIAKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBSGI2TCOBXGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Understand the space limitation here. I guess that's also why below fails on release file:
I honestly didn't know the supported GPU list differs a lot between Windows and Linux from AMD. As 6800 is listed while 7800XT isn't, I wouldn't ask too much. As always, this is THE most accessible method to taste LLM and I already recommend it to all my friends who want to learn. Thanks a lot! |
As I remmember 7800 should work with ROCm 6.0, the 5.7 is not support yet |
Hmm, that's a potential reason, though it works just fine with PyTorch + ROCm 5.7. |
After upgrading to latest ROCm on Fedora 40 as well as the llamafile 0.8.4, things finally appear to be working on this 7840U with 780M(gfx1103):
Thanks @jart and the team's effort to bundle AMD GPU Linux support in TinyBlas. |
For me use gfx1101 on gfx1103 (Ryzen 7940HX) is faster than gfx1100 (with rocblas/hipblas...) |
First, great work on getting AMD GPU support ready on Windows in such a good shape within such a short period. Really appreciate your work!
However, once I switched to Fedora 39, on the same Ryzen 7840U with Radeon 780M laptop, things become a bit puzzling.
At first, it complains about not finding
clang++
andhipcc
:Although, I do have
clang++
andhipcc
available in$PATH
.Then I figured out it might need a bit of manual help, hence adding an environment variable:
This time it compiles but eventually failed due to cuda error.
May I know what additional steps I should take it get it working?
Thanks,
Originally posted by @lovenemesis in #92 (comment)
The text was updated successfully, but these errors were encountered: