Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On Kaggle : libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12 #134929

Closed
FurkanGozukara opened this issue Sep 2, 2024 · 23 comments
Assignees
Labels
high priority module: binaries Anything related to official binaries that we release to users module: cuda Related to torch.cuda, and CUDA support in general module: regression It used to work, and now it doesn't module: sparse Related to torch.sparse triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Milestone

Comments

@FurkanGozukara
Copy link

FurkanGozukara commented Sep 2, 2024

🐛 Describe the bug

I have tried everything but no luck

Waiting your inputs to try more

I tried torch 2.4.0, 2.5 - dev, cu 118, cu121 and cu124 - all same error

This below code - I got the same error when using famous ComfyUI via SwarmUI

import os

# Set CUDA_HOME environment variable
os.environ['CUDA_HOME'] = '/opt/conda'

# Add CUDA binary directory to PATH
os.environ['PATH'] = f"/opt/conda/bin:{os.environ['PATH']}"

# Set LD_LIBRARY_PATH to include CUDA libraries
os.environ['LD_LIBRARY_PATH'] = f"/opt/conda/lib:{os.environ.get('LD_LIBRARY_PATH', '')}"

# Verify CUDA version
!nvcc --version

# Optional: Check if CUDA is available in Python
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")

giving below error

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
---------------------------------------------------------------------------
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[49], line 16
     13 get_ipython().system('nvcc --version')
     15 # Optional: Check if CUDA is available in Python
---> 16 import torch
     17 print(f"CUDA available: {torch.cuda.is_available()}")
     18 if torch.cuda.is_available():

File /opt/conda/lib/python3.10/site-packages/torch/__init__.py:368
    366     if USE_GLOBAL_DEPS:
    367         _load_global_deps()
--> 368     from torch._C import *  # noqa: F403
    371 class SymInt:
    372     """
    373     Like an int (including magic methods), but redirects all operations on the
    374     wrapped node. This is used in particular to symbolically record operations
    375     in the symbolic shape workflow.
    376     """

ImportError: /opt/conda/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12

Versions

Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-5.15.154+-x86_64-with-glibc2.35
Is CUDA available: N/A
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: 
GPU 0: Tesla T4
GPU 1: Tesla T4

Nvidia driver version: 550.90.07
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               4
On-line CPU(s) list:                  0-3
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) CPU @ 2.00GHz
CPU family:                           6
Model:                                85
Thread(s) per core:                   2
Core(s) per socket:                   2
Socket(s):                            1
Stepping:                             3
BogoMIPS:                             4000.36
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            64 KiB (2 instances)
L1i cache:                            64 KiB (2 instances)
L2 cache:                             2 MiB (2 instances)
L3 cache:                             38.5 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-3
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Mitigation; PTE Inversion
Vulnerability Mds:                    Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:               Mitigation; PTI
Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; IBRS
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI SW loop, KVM SW loop
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Mitigation; Clear CPU buffers; SMT Host state unknown

Versions of relevant libraries:
[pip3] flake8==7.0.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] onnx==1.16.2
[pip3] optree==0.11.0
[pip3] pytorch-ignite==0.5.1
[pip3] pytorch-lightning==2.4.0
[pip3] pytorch-triton==3.0.0+dedb7bdf33
[pip3] torch==2.5.0.dev20240901+cu124
[pip3] torchaudio==2.5.0.dev20240901+cu124
[pip3] torchinfo==1.8.0
[pip3] torchmetrics==1.4.1
[pip3] torchvision==0.20.0.dev20240901+cu124
[pip3] triton==3.0.0
[conda] magma-cuda121             2.6.1                         1    pytorch
[conda] mkl                       2024.2.1           ha957f24_103    conda-forge
[conda] numpy                     1.26.4          py310hb13e2d6_0    conda-forge
[conda] optree                    0.11.0                   pypi_0    pypi
[conda] pytorch-ignite            0.5.1                    pypi_0    pypi
[conda] pytorch-lightning         2.4.0                    pypi_0    pypi
[conda] pytorch-triton            3.0.0+dedb7bdf33          pypi_0    pypi
[conda] torch                     2.5.0.dev20240901+cu124          pypi_0    pypi
[conda] torchaudio                2.5.0.dev20240901+cu124          pypi_0    pypi
[conda] torchinfo                 1.8.0                    pypi_0    pypi
[conda] torchmetrics              1.4.1                    pypi_0    pypi
[conda] torchvision               0.20.0.dev20240901+cu124          pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @seemethere @malfet @osalpekar @atalman @alexsamardzic @nikitaved @pearu @cpuhrsch @amjames @bhosmer @jcaip @ptrblck @eqy

@malfet malfet added module: binaries Anything related to official binaries that we release to users topic: binaries module: sparse Related to torch.sparse module: cuda Related to torch.cuda, and CUDA support in general needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user and removed topic: binaries labels Sep 3, 2024
@malfet
Copy link
Contributor

malfet commented Sep 3, 2024

This sounds to me like a great topic one should ask at https://discuss.pytorch.org though we should also extend collect_env to print information about nvidia- packages on has installed.
If torch is installed via PyPI wheels one should not need to define CUDA_HOME or anything like that, but this in turn can affect how torch is searching for its dependencies, which seems to be what is happening now...

@FurkanGozukara
Copy link
Author

This sounds to me like a great topic one should ask at https://discuss.pytorch.org though we should also extend collect_env to print information about nvidia- packages on has installed. If torch is installed via PyPI wheels one should not need to define CUDA_HOME or anything like that, but this in turn can affect how torch is searching for its dependencies, which seems to be what is happening now...

Well it was working when ComfyUI was not using Torch 2.4 but it started after they moved. I think Kaggle still has by default Torch 2.3. Do you know how can I fix this issue? I tried so many commands none worked :/

So many people waiting me to fix this issue if i can. Thank you so much

@sarihl
Copy link

sarihl commented Sep 4, 2024

Facing the same issue, it seems like the issue only occurs when using a notebook, with cuda 12.4
you try downgrading to a previous cuda version (worked for me)
or using a script instead of a notebook

both of these are workarounds though :/

@FurkanGozukara
Copy link
Author

you try downgrading to a previous cuda version (worked for me)

how do we downgrade notebook cuda version?

@malfet malfet added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: regression It used to work, and now it doesn't labels Sep 4, 2024
@malfet
Copy link
Contributor

malfet commented Sep 4, 2024

@sarihl is there a document somewhere that I can read thru that documents the installation process? I can run torch-2.4 from jupyter notebook just fine

@FurkanGozukara
Copy link
Author

@sarihl is there a document somewhere that I can read thru that documents the installation process? I can run torch-2.4 from jupyter notebook just fine

You run it inside kaggle?

@malfet
Copy link
Contributor

malfet commented Sep 4, 2024

@sarihl is there a document somewhere that I can read thru that documents the installation process? I can run torch-2.4 from jupyter notebook just fine

You run it inside kaggle?

Can you share a link to the notebook? I've tried running it and it seems to work fine for me: https://www.kaggle.com/code/malfet/check-torch-version

@FurkanGozukara
Copy link
Author

@sarihl is there a document somewhere that I can read thru that documents the installation process? I can run torch-2.4 from jupyter notebook just fine

You run it inside kaggle?

Can you share a link to the notebook? I've tried running it and it seems to work fine for me: https://www.kaggle.com/code/malfet/check-torch-version

here here the notebook

to be able to see it you need to connect via ngrok and install famous swarmui and make it install comfyui backend

it is so easy and straight forward actually

notebookc7ac6afeca.txt

@willlllllio
Copy link

Side note but @malfet just be aware that using Kaggle for running Diffusion WebUIs is against their ToS so you might wanna be careful with your Kaggle account when trying that, as FurkanGozukara here also surely knows since he even commented on a thread about it there where a Kaggle staff memeber explains that https://www.kaggle.com/discussions/product-feedback/440296

@malfet
Copy link
Contributor

malfet commented Sep 5, 2024

@willlllllio thank you for the warning. I wasn't aware of that.

At this point, it does not seem like a PyTorch issue, but may be a bug with SwarmUI or whatever that creates custom environment that forces libtorch to link against wrong nvjitlink. So I think needs reproduction is the right label. I would look into it again, if there is a link to Google Colab, Kaggle or something that can be run end to end and reproduces the problem. And once again, if one install CUDA-12.2 and points LD_LIBRARY_PATH to that path torch will be force to link with wrong version and fail, but there isn't much one can do on the torch side to fix it.

@malfet malfet removed the needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user label Oct 19, 2024
@malfet
Copy link
Contributor

malfet commented Oct 19, 2024

@albanD showed me a reproducer, we need to add nvjitlink to rpath for libtorch_cuda.so

@jt-michels
Copy link

I am having a similar issue, but using Paperspace VM, not Kaggle...

@yondonfu
Copy link

FWIW I ran into this issue on a machine with system runtime CUDA 12.2 (as reported by nvcc) and the latest ComfyUI which seems to install torch 2.5.0 CUDA 12.4. This seems to line up with the scenario mentioned in #138460 which notes that the dep issue most commonly comes up when the the binary for CUDA 12.4 is installed with a global install < 12.4.

My current workaround for now is to just downgrade to a version of torch for a previous version of CUDA:

pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

@khanfarhan10
Copy link

This issue is really persisting and looks serious.

A thread is present here : #111469

@albanD
Copy link
Collaborator

albanD commented Dec 2, 2024

bumping priority for activity and the fact that we have a good idea how to fix

@apaz-cli
Copy link

apaz-cli commented Jan 10, 2025

@janeyx99 @malfet

I still run into this every day. I'm in a venv, so my solution was to add .venv/lib/python3.10/site-packages/nvidia/nvjitlink/lib to my LD_LIBRARY_PATH.

This is also the same issue as #111469, which is closed by the author because they were unblocked, but the issue didn't get resolved in the general case.

My workaround is that I switch to the venv, then paste in

LD_LIBRARY_PATH=$(python -c "import site; print(site.getsitepackages()[0] + '/nvidia/nvjitlink/lib')"):$LD_LIBRARY_PATH

I would put in the PR to fix it myself, but I'm not quite sure how the import resolves object files or where the code that searches for it is.

@malfet
Copy link
Contributor

malfet commented Jan 15, 2025

Anyone wants to try latest nightly, which includes #141063 (will be included in 2.6 release) that to the best of my understanding fixes the problem, though I have not tired reproducing in on kaggle

@atalman
Copy link
Contributor

atalman commented Jan 22, 2025

Hi @FurkanGozukara can you please confirm this is fixed ? Using following install command. This should installl torch 2.6 release candidate:

pip3 install torch numpy --index-url https://download.pytorch.org/whl/test/cu124

@bilzard
Copy link
Contributor

bilzard commented Jan 29, 2025

I checked current nightly version fixes the issue.
I also found workaround for older version.

Check on Nightly Version --> worked

Anyone wants to try latest nightly, which includes #141063 (will be included in 2.6 release) that to the best of my understanding fixes the problem, though I have not tired reproducing in on kaggle

Yes. It worked on 2.7.0.dev20250128+cu126.

Install nightly version of pytorch

! pip3 install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126

import pytorch --> worked

import torch

torch.__version__
/opt/conda/lib/python3.10/site-packages/torch/utils/_pytree.py:174: FutureWarning: optree is installed but the version is too old to support PyTorch Dynamo in C++ pytree. C++ pytree support is disabled. Please consider upgrading optree using `python3 -m pip install --upgrade 'optree>=0.13.0'`.
  warnings.warn(

'2.7.0.dev20250128+cu126'

Workaround for older version

Check NVCC version

!nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Install pytorch that matches nvcc CUDA version

! pip install -U torch torchvision torchaudio --target=/kaggle/working --index-url https://download.pytorch.org/whl/cu121

Set path to libnvJitLink.so in LD_PRELOAD environment variable

import os

ld_preload_path = !find /usr/local/cuda* -name "libnvJitLink.so*" | head -n 1
if ld_preload_path:
    os.environ["LD_PRELOAD"] = ld_preload_path[0]

print(os.environ.get("LD_PRELOAD"))

import pytorch --> works fine

import torch

torch.__version__
'2.5.1+cu121'

@FurkanGozukara
Copy link
Author

@bilzard thank you so much

i am glad this is getting fixed

@atalman
Copy link
Contributor

atalman commented Jan 31, 2025

Could someone please confirm if its works with latest release 2.6:

pip3 install torch numpy 

And we can close this issue.

@bilzard
Copy link
Contributor

bilzard commented Feb 3, 2025

@atalman I checked the latest release torch==2.6.0+cu124 fixes the issue.

requirements.txt:

torch
torchvision
torchaudio
! pip install -U -r requirements.txt --target=/kaggle/working
import torch

torch.__version__
'2.6.0+cu124'

@atalman
Copy link
Contributor

atalman commented Feb 3, 2025

Closing this issue . Resolved in nightly and release 2.6

@atalman atalman closed this as completed Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: binaries Anything related to official binaries that we release to users module: cuda Related to torch.cuda, and CUDA support in general module: regression It used to work, and now it doesn't module: sparse Related to torch.sparse triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

13 participants