"CUDA error" when set resolution higher than 1280 x 1280 #156

XienXX · 2024-01-23T02:27:12Z

CUDA Version:12.3
GPU: RTX 4080 16G

Model works alright under the condition of 1024 x 1024. But if I set it to 1280x1280 or above, the launch will fails. Check below:
1280x1280 resolution, failed:
PS D:\xien\stable-diffusion.cpp\build\bin\Release> .\sd.exe -m ../v2-1_768-nonema-pruned.safetensors --type f16 -p "a lovely cat" -H 1280 -W 1280
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
[INFO ] stable-diffusion.cpp:137 - loading model from '../v2-1_768-nonema-pruned.safetensors'
[INFO ] model.cpp:641 - load ../v2-1_768-nonema-pruned.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:163 - Stable Diffusion 2.x
[INFO ] stable-diffusion.cpp:169 - Stable Diffusion weight type: f16
[INFO ] stable-diffusion.cpp:268 - total memory buffer size = 2450.99MB (clip 684.18MB, unet 1662.34MB, vae 104.47MB)
[INFO ] stable-diffusion.cpp:270 - loading model from '../v2-1_768-nonema-pruned.safetensors' completed, taking 2.67s
[INFO ] stable-diffusion.cpp:282 - running in v-prediction mode
[INFO ] stable-diffusion.cpp:1182 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1221 - get_learned_condition completed, taking 28 ms
[INFO ] stable-diffusion.cpp:1231 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1235 - generating image: 1/1 - seed 42
|> | 0/20 - 0.00it/sCUDA error: the function failed to launch on the GPU
current device: 0, in function ggml_cuda_op_mul_mat_cublas at D:\xien\stable-diffusion.cpp\ggml\src\ggml-cuda.cu:7650
cublasSgemm_v2(g_cublas_handles[id], CUBLAS_OP_T, CUBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha, src0_ddf_i, ne00, src1_ddf1_i, ne10, &beta, dst_dd_i, ldc)
GGML_ASSERT: D:\xien\stable-diffusion.cpp\ggml\src\ggml-cuda.cu:226: !"CUDA error"

1280x1024 resolution, worked:
PS D:\xien\stable-diffusion.cpp\build\bin\Release> .\sd.exe -m ../v2-1_768-nonema-pruned.safetensors --type f16 -p "a lovely cat" -H 1280 -W 1024
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
[INFO ] stable-diffusion.cpp:137 - loading model from '../v2-1_768-nonema-pruned.safetensors'
[INFO ] model.cpp:641 - load ../v2-1_768-nonema-pruned.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:163 - Stable Diffusion 2.x
[INFO ] stable-diffusion.cpp:169 - Stable Diffusion weight type: f16
[INFO ] stable-diffusion.cpp:268 - total memory buffer size = 2450.99MB (clip 684.18MB, unet 1662.34MB, vae 104.47MB)
[INFO ] stable-diffusion.cpp:270 - loading model from '../v2-1_768-nonema-pruned.safetensors' completed, taking 2.69s
[INFO ] stable-diffusion.cpp:282 - running in v-prediction mode
[INFO ] stable-diffusion.cpp:1182 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1221 - get_learned_condition completed, taking 30 ms
[INFO ] stable-diffusion.cpp:1231 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1235 - generating image: 1/1 - seed 42
|==================================================| 20/20 - 1.08it/s
[INFO ] stable-diffusion.cpp:1247 - sampling completed, taking 19.60s
[INFO ] stable-diffusion.cpp:1255 - generating 1 latent images completed, taking 19.61s
[INFO ] stable-diffusion.cpp:1257 - decoding 1 latents
[INFO ] stable-diffusion.cpp:1267 - latent 1 decoded, taking 1.45s
[INFO ] stable-diffusion.cpp:1271 - decode_first_stage completed, taking 1.45s
[INFO ] stable-diffusion.cpp:1290 - txt2img completed in 21.09s
save result image to 'output.png'

XienXX · 2024-01-24T09:28:27Z

I switch to RTX 5000 Ada(48G) and the model goes the same. please help!!

FSSRepo · 2024-01-27T17:19:22Z

It seems to be an error in the way matrix multiplications are performed in ggml. Does it work if you do it only with CPU?

XienXX · 2024-01-29T02:12:54Z

It seems to be an error in the way matrix multiplications are performed in ggml. Does it work if you do it only with CPU?

Seem not. Shall I re-cmake it again?

FSSRepo · 2024-01-29T02:15:07Z

@XienXX cmake .. - DSD_CUBLAS=OFF

XienXX · 2024-01-29T02:41:36Z

@XienXX cmake .. - DSD_CUBLAS=OFF

Yep thanks, it could work, but the speed is way too slow XD. How could I run it with GPU?

errnoh · 2024-03-25T17:57:19Z

Can replicate this with HIPBLAS. 768x768 works, 768x1024 works, 1024x1024 fails, 1280x1280 fails.
Interesting also how 1024x1024 and 1280x1280 fail in different ways.

EDIT: Actually that seems to be only happening with v1.5 model. SDXL works fine with 1280x1280.

[errnoh@desk:~/dev/AI/stable-diffusion.cpp]$ ./result/bin/sd -m /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors   -p "a lovely cat" -H 768 -W 768
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
[INFO ] stable-diffusion.cpp:171  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors'
[INFO ] model.cpp:726  - load /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:194  - Stable Diffusion 1.x 
[INFO ] stable-diffusion.cpp:200  - Stable Diffusion weight type: f32
[INFO ] stable-diffusion.cpp:421  - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:425  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors' completed, taking 1.09s
[INFO ] stable-diffusion.cpp:442  - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:553  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1608 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1718 - get_learned_condition completed, taking 24 ms
[INFO ] stable-diffusion.cpp:1734 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1738 - generating image: 1/1 - seed 42
  |==================================================| 20/20 - 1.96it/s
[INFO ] stable-diffusion.cpp:1775 - sampling completed, taking 10.33s
[INFO ] stable-diffusion.cpp:1783 - generating 1 latent images completed, taking 10.33s
[INFO ] stable-diffusion.cpp:1785 - decoding 1 latents
[INFO ] stable-diffusion.cpp:1795 - latent 1 decoded, taking 1.51s
[INFO ] stable-diffusion.cpp:1799 - decode_first_stage completed, taking 1.51s
[INFO ] stable-diffusion.cpp:1818 - txt2img completed in 11.86s
save result image to 'output.png'

[errnoh@desk:~/dev/AI/stable-diffusion.cpp]$ ./result/bin/sd -m /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors   -p "a lovely cat" -H 1024 -W 1024
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
[INFO ] stable-diffusion.cpp:171  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors'
[INFO ] model.cpp:726  - load /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:194  - Stable Diffusion 1.x 
[INFO ] stable-diffusion.cpp:200  - Stable Diffusion weight type: f32
[INFO ] stable-diffusion.cpp:421  - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:425  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors' completed, taking 1.11s
[INFO ] stable-diffusion.cpp:442  - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:553  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1608 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1718 - get_learned_condition completed, taking 24 ms
[INFO ] stable-diffusion.cpp:1734 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1738 - generating image: 1/1 - seed 42
CUDA error: invalid configuration argument
  current device: 0, in function ggml_cuda_op_scale at /build/stable-diffusion.cpp/ggml/src/ggml-cuda.cu:10030
  hipGetLastError()
GGML_ASSERT: /build/stable-diffusion.cpp/ggml/src/ggml-cuda.cu:255: !"CUDA error"
Aborted (core dumped)

[errnoh@desk:~/dev/AI/stable-diffusion.cpp]$ ./result/bin/sd -m /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors   -p "a lovely cat" -H 1280 -W 1280
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
[INFO ] stable-diffusion.cpp:171  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors'
[INFO ] model.cpp:726  - load /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:194  - Stable Diffusion 1.x 
[INFO ] stable-diffusion.cpp:200  - Stable Diffusion weight type: f32
[INFO ] stable-diffusion.cpp:421  - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:425  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors' completed, taking 1.09s
[INFO ] stable-diffusion.cpp:442  - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:553  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1608 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1718 - get_learned_condition completed, taking 25 ms
[INFO ] stable-diffusion.cpp:1734 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1738 - generating image: 1/1 - seed 42
Memory access fault by GPU node-1 (Agent handle: 0x557c4d24b7d0) on address 0x7fc5fdc8b000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

DGdev91 · 2024-05-02T20:07:33Z

Same here, HIPblas, RX 7900XT the maximum i managed to make on SD 1.5 is 960x1024, while on SDXL i managed to make a 1920x1920 picture, before encountering the same issue.

DGdev91 mentioned this issue May 2, 2024

Unsafe code To safe code DarthAffe/StableDiffusion.NET#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"CUDA error" when set resolution higher than 1280 x 1280 #156

"CUDA error" when set resolution higher than 1280 x 1280 #156

XienXX commented Jan 23, 2024

XienXX commented Jan 24, 2024

FSSRepo commented Jan 27, 2024

XienXX commented Jan 29, 2024

FSSRepo commented Jan 29, 2024

XienXX commented Jan 29, 2024

errnoh commented Mar 25, 2024 •

edited

Loading

DGdev91 commented May 2, 2024

"CUDA error" when set resolution higher than 1280 x 1280 #156

"CUDA error" when set resolution higher than 1280 x 1280 #156

Comments

XienXX commented Jan 23, 2024

XienXX commented Jan 24, 2024

FSSRepo commented Jan 27, 2024

XienXX commented Jan 29, 2024

FSSRepo commented Jan 29, 2024

XienXX commented Jan 29, 2024

errnoh commented Mar 25, 2024 • edited Loading

DGdev91 commented May 2, 2024

errnoh commented Mar 25, 2024 •

edited

Loading