Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"CUDA error" when set resolution higher than 1280 x 1280 #156

Open
XienXX opened this issue Jan 23, 2024 · 7 comments
Open

"CUDA error" when set resolution higher than 1280 x 1280 #156

XienXX opened this issue Jan 23, 2024 · 7 comments

Comments

@XienXX
Copy link

XienXX commented Jan 23, 2024

CUDA Version:12.3
GPU: RTX 4080 16G

Model works alright under the condition of 1024 x 1024. But if I set it to 1280x1280 or above, the launch will fails. Check below:
1280x1280 resolution, failed:
PS D:\xien\stable-diffusion.cpp\build\bin\Release> .\sd.exe -m ../v2-1_768-nonema-pruned.safetensors --type f16 -p "a lovely cat" -H 1280 -W 1280
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
[INFO ] stable-diffusion.cpp:137 - loading model from '../v2-1_768-nonema-pruned.safetensors'
[INFO ] model.cpp:641 - load ../v2-1_768-nonema-pruned.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:163 - Stable Diffusion 2.x
[INFO ] stable-diffusion.cpp:169 - Stable Diffusion weight type: f16
[INFO ] stable-diffusion.cpp:268 - total memory buffer size = 2450.99MB (clip 684.18MB, unet 1662.34MB, vae 104.47MB)
[INFO ] stable-diffusion.cpp:270 - loading model from '../v2-1_768-nonema-pruned.safetensors' completed, taking 2.67s
[INFO ] stable-diffusion.cpp:282 - running in v-prediction mode
[INFO ] stable-diffusion.cpp:1182 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1221 - get_learned_condition completed, taking 28 ms
[INFO ] stable-diffusion.cpp:1231 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1235 - generating image: 1/1 - seed 42
|> | 0/20 - 0.00it/sCUDA error: the function failed to launch on the GPU
current device: 0, in function ggml_cuda_op_mul_mat_cublas at D:\xien\stable-diffusion.cpp\ggml\src\ggml-cuda.cu:7650
cublasSgemm_v2(g_cublas_handles[id], CUBLAS_OP_T, CUBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha, src0_ddf_i, ne00, src1_ddf1_i, ne10, &beta, dst_dd_i, ldc)
GGML_ASSERT: D:\xien\stable-diffusion.cpp\ggml\src\ggml-cuda.cu:226: !"CUDA error"

1280x1024 resolution, worked:
PS D:\xien\stable-diffusion.cpp\build\bin\Release> .\sd.exe -m ../v2-1_768-nonema-pruned.safetensors --type f16 -p "a lovely cat" -H 1280 -W 1024
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
[INFO ] stable-diffusion.cpp:137 - loading model from '../v2-1_768-nonema-pruned.safetensors'
[INFO ] model.cpp:641 - load ../v2-1_768-nonema-pruned.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:163 - Stable Diffusion 2.x
[INFO ] stable-diffusion.cpp:169 - Stable Diffusion weight type: f16
[INFO ] stable-diffusion.cpp:268 - total memory buffer size = 2450.99MB (clip 684.18MB, unet 1662.34MB, vae 104.47MB)
[INFO ] stable-diffusion.cpp:270 - loading model from '../v2-1_768-nonema-pruned.safetensors' completed, taking 2.69s
[INFO ] stable-diffusion.cpp:282 - running in v-prediction mode
[INFO ] stable-diffusion.cpp:1182 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1221 - get_learned_condition completed, taking 30 ms
[INFO ] stable-diffusion.cpp:1231 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1235 - generating image: 1/1 - seed 42
|==================================================| 20/20 - 1.08it/s
[INFO ] stable-diffusion.cpp:1247 - sampling completed, taking 19.60s
[INFO ] stable-diffusion.cpp:1255 - generating 1 latent images completed, taking 19.61s
[INFO ] stable-diffusion.cpp:1257 - decoding 1 latents
[INFO ] stable-diffusion.cpp:1267 - latent 1 decoded, taking 1.45s
[INFO ] stable-diffusion.cpp:1271 - decode_first_stage completed, taking 1.45s
[INFO ] stable-diffusion.cpp:1290 - txt2img completed in 21.09s
save result image to 'output.png'

屏幕截图 2024-01-23 102337
image

@XienXX
Copy link
Author

XienXX commented Jan 24, 2024

I switch to RTX 5000 Ada(48G) and the model goes the same. please help!!

@FSSRepo
Copy link
Contributor

FSSRepo commented Jan 27, 2024

It seems to be an error in the way matrix multiplications are performed in ggml. Does it work if you do it only with CPU?

@XienXX
Copy link
Author

XienXX commented Jan 29, 2024

It seems to be an error in the way matrix multiplications are performed in ggml. Does it work if you do it only with CPU?

image

Seem not. Shall I re-cmake it again?

@FSSRepo
Copy link
Contributor

FSSRepo commented Jan 29, 2024

@XienXX cmake .. - DSD_CUBLAS=OFF

@XienXX
Copy link
Author

XienXX commented Jan 29, 2024

@XienXX cmake .. - DSD_CUBLAS=OFF

image
Yep thanks, it could work, but the speed is way too slow XD. How could I run it with GPU?

@errnoh
Copy link

errnoh commented Mar 25, 2024

Can replicate this with HIPBLAS. 768x768 works, 768x1024 works, 1024x1024 fails, 1280x1280 fails.
Interesting also how 1024x1024 and 1280x1280 fail in different ways.

EDIT: Actually that seems to be only happening with v1.5 model. SDXL works fine with 1280x1280.

[errnoh@desk:~/dev/AI/stable-diffusion.cpp]$ ./result/bin/sd -m /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors   -p "a lovely cat" -H 768 -W 768
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
[INFO ] stable-diffusion.cpp:171  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors'
[INFO ] model.cpp:726  - load /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:194  - Stable Diffusion 1.x 
[INFO ] stable-diffusion.cpp:200  - Stable Diffusion weight type: f32
[INFO ] stable-diffusion.cpp:421  - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:425  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors' completed, taking 1.09s
[INFO ] stable-diffusion.cpp:442  - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:553  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1608 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1718 - get_learned_condition completed, taking 24 ms
[INFO ] stable-diffusion.cpp:1734 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1738 - generating image: 1/1 - seed 42
  |==================================================| 20/20 - 1.96it/s
[INFO ] stable-diffusion.cpp:1775 - sampling completed, taking 10.33s
[INFO ] stable-diffusion.cpp:1783 - generating 1 latent images completed, taking 10.33s
[INFO ] stable-diffusion.cpp:1785 - decoding 1 latents
[INFO ] stable-diffusion.cpp:1795 - latent 1 decoded, taking 1.51s
[INFO ] stable-diffusion.cpp:1799 - decode_first_stage completed, taking 1.51s
[INFO ] stable-diffusion.cpp:1818 - txt2img completed in 11.86s
save result image to 'output.png'

[errnoh@desk:~/dev/AI/stable-diffusion.cpp]$ ./result/bin/sd -m /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors   -p "a lovely cat" -H 1024 -W 1024
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
[INFO ] stable-diffusion.cpp:171  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors'
[INFO ] model.cpp:726  - load /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:194  - Stable Diffusion 1.x 
[INFO ] stable-diffusion.cpp:200  - Stable Diffusion weight type: f32
[INFO ] stable-diffusion.cpp:421  - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:425  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors' completed, taking 1.11s
[INFO ] stable-diffusion.cpp:442  - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:553  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1608 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1718 - get_learned_condition completed, taking 24 ms
[INFO ] stable-diffusion.cpp:1734 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1738 - generating image: 1/1 - seed 42
CUDA error: invalid configuration argument
  current device: 0, in function ggml_cuda_op_scale at /build/stable-diffusion.cpp/ggml/src/ggml-cuda.cu:10030
  hipGetLastError()
GGML_ASSERT: /build/stable-diffusion.cpp/ggml/src/ggml-cuda.cu:255: !"CUDA error"
Aborted (core dumped)

[errnoh@desk:~/dev/AI/stable-diffusion.cpp]$ ./result/bin/sd -m /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors   -p "a lovely cat" -H 1280 -W 1280
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
[INFO ] stable-diffusion.cpp:171  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors'
[INFO ] model.cpp:726  - load /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:194  - Stable Diffusion 1.x 
[INFO ] stable-diffusion.cpp:200  - Stable Diffusion weight type: f32
[INFO ] stable-diffusion.cpp:421  - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:425  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors' completed, taking 1.09s
[INFO ] stable-diffusion.cpp:442  - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:553  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1608 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1718 - get_learned_condition completed, taking 25 ms
[INFO ] stable-diffusion.cpp:1734 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1738 - generating image: 1/1 - seed 42
Memory access fault by GPU node-1 (Agent handle: 0x557c4d24b7d0) on address 0x7fc5fdc8b000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

@DGdev91
Copy link

DGdev91 commented May 2, 2024

Same here, HIPblas, RX 7900XT the maximum i managed to make on SD 1.5 is 960x1024, while on SDXL i managed to make a 1920x1920 picture, before encountering the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants