Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: I want to integrate vllm into LLaMA-Factory, a transformers-based LLM training framework. However, I encountered two bugs: RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method & RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details) #9469

Open
1 task done
takagi97 opened this issue Oct 17, 2024 · 9 comments
Labels
bug Something isn't working unstale

Comments

@takagi97
Copy link

takagi97 commented Oct 17, 2024

Your current environment

The output of `python collect_env.py`
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.10.0 (default, Mar  3 2022, 09:58:08) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.15.0-50-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.66
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A800 80GB PCIe
GPU 1: NVIDIA A800 80GB PCIe
GPU 2: NVIDIA A800 80GB PCIe
GPU 3: NVIDIA A800 80GB PCIe
GPU 4: NVIDIA A800 80GB PCIe
GPU 5: NVIDIA A800 80GB PCIe
GPU 6: NVIDIA A800 80GB PCIe
GPU 7: NVIDIA A800 80GB PCIe

Nvidia driver version: 530.30.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 57 bits virtual
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              2
Core(s) per socket:              32
Socket(s):                       2
NUMA node(s):                    4
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           106
Model name:                      Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz
Stepping:                        6
CPU MHz:                         800.000
CPU max MHz:                     3400.0000
CPU min MHz:                     800.0000
BogoMIPS:                        5200.00
L1d cache:                       3 MiB
L1i cache:                       2 MiB
L2 cache:                        80 MiB
L3 cache:                        96 MiB
NUMA node0 CPU(s):               0-15,64-79
NUMA node1 CPU(s):               16-31,80-95
NUMA node2 CPU(s):               32-47,96-111
NUMA node3 CPU(s):               48-63,112-127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.6.68
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.45.0
[pip3] triton==3.0.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.1.3.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.0.2.54                pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.2.106               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.4.5.107               pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.1.0.106               pypi_0    pypi
[conda] nvidia-ml-py              12.560.30                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.6.68                  pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
[conda] pyzmq                     26.2.0                   pypi_0    pypi
[conda] torch                     2.4.0                    pypi_0    pypi
[conda] torchvision               0.19.0                   pypi_0    pypi
[conda] transformers              4.45.0                   pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A (dev)
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    CPU Affinity    NUMA Affinity
GPU0     X      PIX     NV8     PIX     SYS     SYS     SYS     SYS     NODE    PIX     0-15,64-79      0
GPU1    PIX      X      PIX     NV8     SYS     SYS     SYS     SYS     NODE    PIX     0-15,64-79      0
GPU2    NV8     PIX      X      PIX     SYS     SYS     SYS     SYS     NODE    PIX     0-15,64-79      0
GPU3    PIX     NV8     PIX      X      SYS     SYS     SYS     SYS     NODE    PIX     0-15,64-79      0
GPU4    SYS     SYS     SYS     SYS      X      PIX     PIX     NV6     SYS     SYS     32-47,96-111    2
GPU5    SYS     SYS     SYS     SYS     PIX      X      NV4     PIX     SYS     SYS     32-47,96-111    2
GPU6    SYS     SYS     SYS     SYS     PIX     NV4      X      PIX     SYS     SYS     32-47,96-111    2
GPU7    SYS     SYS     SYS     SYS     NV6     PIX     PIX      X      SYS     SYS     32-47,96-111    2
NIC0    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE
NIC1    PIX     PIX     PIX     PIX     SYS     SYS     SYS     SYS     NODE     X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1

Model Input Dumps

No response

🐛 Describe the bug

I want to utilize VLLM to conduct LLM inference and efficiently derive the output probability distribution for token-level knowledge distillation. To achieve this, I need to first use VLLM for inference and then use its output to train student models. For the implementation, I integrated VLLM (0.6.2) into LLaMA-Factory (0.9.0), a Transformers (4.45.0)-based LLM training framework. However, when I run my code, I encounter the following bug:

ERROR-1
(VllmWorkerProcess pid=3624105) INFO 10-17 22:24:11 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3624105) ERROR 10-17 22:24:11 multiproc_worker_utils.py:233] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last):
(VllmWorkerProcess pid=3624105) ERROR 10-17 22:24:11 multiproc_worker_utils.py:233]   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 226, in _run_worker_process
(VllmWorkerProcess pid=3624105) ERROR 10-17 22:24:11 multiproc_worker_utils.py:233]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=3624105) ERROR 10-17 22:24:11 multiproc_worker_utils.py:233]   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/worker/worker.py", line 166, in init_device
(VllmWorkerProcess pid=3624105) ERROR 10-17 22:24:11 multiproc_worker_utils.py:233]     torch.cuda.set_device(self.device)
(VllmWorkerProcess pid=3624105) ERROR 10-17 22:24:11 multiproc_worker_utils.py:233]   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/torch/cuda/__init__.py", line 420, in set_device
(VllmWorkerProcess pid=3624105) ERROR 10-17 22:24:11 multiproc_worker_utils.py:233]     torch._C._cuda_setDevice(device)
(VllmWorkerProcess pid=3624105) ERROR 10-17 22:24:11 multiproc_worker_utils.py:233]   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/torch/cuda/__init__.py", line 300, in _lazy_init
(VllmWorkerProcess pid=3624105) ERROR 10-17 22:24:11 multiproc_worker_utils.py:233]     raise RuntimeError(
(VllmWorkerProcess pid=3624105) ERROR 10-17 22:24:11 multiproc_worker_utils.py:233] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
(VllmWorkerProcess pid=3624105) ERROR 10-17 22:24:11 multiproc_worker_utils.py:233] 
INFO 10-17 22:24:11 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=3624112) INFO 10-17 22:24:11 multiproc_worker_utils.py:218] Worker ready; awaiting tasks
(VllmWorkerProcess pid=3624112) ERROR 10-17 22:24:11 multiproc_worker_utils.py:233] Exception in worker VllmWorkerProcess while processing method init_device: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method, Traceback (most recent call last):
(VllmWorkerProcess pid=3624112) ERROR 10-17 22:24:11 multiproc_worker_utils.py:233]   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 226, in _run_worker_process
(VllmWorkerProcess pid=3624112) ERROR 10-17 22:24:11 multiproc_worker_utils.py:233]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=3624112) ERROR 10-17 22:24:11 multiproc_worker_utils.py:233]   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/worker/worker.py", line 166, in init_device
(VllmWorkerProcess pid=3624112) ERROR 10-17 22:24:11 multiproc_worker_utils.py:233]     torch.cuda.set_device(self.device)
(VllmWorkerProcess pid=3624112) ERROR 10-17 22:24:11 multiproc_worker_utils.py:233]   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/torch/cuda/__init__.py", line 420, in set_device
(VllmWorkerProcess pid=3624112) ERROR 10-17 22:24:11 multiproc_worker_utils.py:233]     torch._C._cuda_setDevice(device)
(VllmWorkerProcess pid=3624112) ERROR 10-17 22:24:11 multiproc_worker_utils.py:233]   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/torch/cuda/__init__.py", line 300, in _lazy_init
(VllmWorkerProcess pid=3624112) ERROR 10-17 22:24:11 multiproc_worker_utils.py:233]     raise RuntimeError(
(VllmWorkerProcess pid=3624112) ERROR 10-17 22:24:11 multiproc_worker_utils.py:233] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
(VllmWorkerProcess pid=3624112) ERROR 10-17 22:24:11 multiproc_worker_utils.py:233] 
INFO 10-17 22:24:12 utils.py:993] Found nccl from library libnccl.so.2
INFO 10-17 22:24:12 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 10-17 22:24:12 utils.py:993] Found nccl from library libnccl.so.2
INFO 10-17 22:24:12 pynccl.py:63] vLLM is using nccl==2.20.5
[rank0]: Traceback (most recent call last):
[rank0]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/launcher.py", line 23, in <module>
[rank0]:     launch()
[rank0]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/launcher.py", line 19, in launch
[rank0]:     run_exp()
[rank0]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/train/tuner.py", line 61, in run_exp
[rank0]:     run_cwc(model_args, data_args, training_args, finetuning_args, callbacks)
[rank0]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/train/cwc/workflow.py", line 70, in run_cwc
[rank0]:     source_model = LLM(model=model_args.cwc_source_model_name_or_path, tensor_parallel_size=training_args.world_size)
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 214, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 564, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 325, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 110, in _init_executor
[rank0]:     self._run_workers("init_device")
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/worker/worker.py", line 176, in init_device
[rank0]:     init_worker_distributed_environment(self.parallel_config, self.rank,
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/worker/worker.py", line 451, in init_worker_distributed_environment
[rank0]:     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 1059, in ensure_model_parallel_initialized
[rank0]:     initialize_model_parallel(tensor_model_parallel_size,
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 1023, in initialize_model_parallel
[rank0]:     _TP = init_model_parallel_group(group_ranks,
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 864, in init_model_parallel_group
[rank0]:     return GroupCoordinator(
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 214, in __init__
[rank0]:     self.pynccl_comm = PyNcclCommunicator(
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__
[rank0]:     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank
[rank0]:     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
[rank0]:     raise RuntimeError(f"NCCL error: {error_str}")
[rank0]: RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
[rank1]: Traceback (most recent call last):
[rank1]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/launcher.py", line 23, in <module>
[rank1]:     launch()
[rank1]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/launcher.py", line 19, in launch
[rank1]:     run_exp()
[rank1]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/train/tuner.py", line 61, in run_exp
[rank1]:     run_cwc(model_args, data_args, training_args, finetuning_args, callbacks)
[rank1]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/train/cwc/workflow.py", line 70, in run_cwc
[rank1]:     source_model = LLM(model=model_args.cwc_source_model_name_or_path, tensor_parallel_size=training_args.world_size)
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 214, in __init__
[rank1]:     self.llm_engine = LLMEngine.from_engine_args(
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 564, in from_engine_args
[rank1]:     engine = cls(
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 325, in __init__
[rank1]:     self.model_executor = executor_class(
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
[rank1]:     super().__init__(*args, **kwargs)
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
[rank1]:     self._init_executor()
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 110, in _init_executor
[rank1]:     self._run_workers("init_device")
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers
[rank1]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/worker/worker.py", line 176, in init_device
[rank1]:     init_worker_distributed_environment(self.parallel_config, self.rank,
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/worker/worker.py", line 451, in init_worker_distributed_environment
[rank1]:     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 1059, in ensure_model_parallel_initialized
[rank1]:     initialize_model_parallel(tensor_model_parallel_size,
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 1023, in initialize_model_parallel
[rank1]:     _TP = init_model_parallel_group(group_ranks,
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 864, in init_model_parallel_group
[rank1]:     return GroupCoordinator(
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 214, in __init__
[rank1]:     self.pynccl_comm = PyNcclCommunicator(
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__
[rank1]:     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank
[rank1]:     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
[rank1]:     raise RuntimeError(f"NCCL error: {error_str}")
[rank1]: RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
INFO 10-17 22:24:13 multiproc_worker_utils.py:124] Killing local vLLM worker processes
INFO 10-17 22:24:13 multiproc_worker_utils.py:124] Killing local vLLM worker processes

After debugging for hours and searching through issues, I realized that both LLaMA-Factory and Transformers are calling torch.cuda functions, such as torch.cuda.is_available(), before the initialization of vllm models (source_model = LLM(model=model_args.cwc_source_model_name_or_path, tensor_parallel_size=training_args.world_size)), which leads to the bug. However, there are numerous calls to torch.cuda functions throughout the project, and I cannot remove all of them. :(
After reading this issue, I set export VLLM_WORKER_MULTIPROC_METHOD=spawn, but there is another bug.

ERROR-2
[rank1]: Traceback (most recent call last):
[rank1]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/launcher.py", line 23, in <module>
[rank1]:     launch()
[rank1]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/launcher.py", line 19, in launch
[rank1]:     run_exp()
[rank1]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/train/tuner.py", line 61, in run_exp
[rank1]:     run_cwc(model_args, data_args, training_args, finetuning_args, callbacks)
[rank1]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/train/cwc/workflow.py", line 70, in run_cwc
[rank1]:     source_model = LLM(model=model_args.cwc_source_model_name_or_path, tensor_parallel_size=training_args.world_size)
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 214, in __init__
[rank1]:     self.llm_engine = LLMEngine.from_engine_args(
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 564, in from_engine_args
[rank1]:     engine = cls(
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 325, in __init__
[rank1]:     self.model_executor = executor_class(
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
[rank1]:     super().__init__(*args, **kwargs)
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
[rank1]:     self._init_executor()
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 110, in _init_executor
[rank1]:     self._run_workers("init_device")
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers
[rank1]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/worker/worker.py", line 176, in init_device
[rank1]:     init_worker_distributed_environment(self.parallel_config, self.rank,
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/worker/worker.py", line 451, in init_worker_distributed_environment
[rank1]:     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 1059, in ensure_model_parallel_initialized
[rank1]:     initialize_model_parallel(tensor_model_parallel_size,
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 1023, in initialize_model_parallel
[rank1]:     _TP = init_model_parallel_group(group_ranks,
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 864, in init_model_parallel_group
[rank1]:     return GroupCoordinator(
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 214, in __init__
[rank1]:     self.pynccl_comm = PyNcclCommunicator(
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__
[rank1]:     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank
[rank1]:     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
[rank1]:     raise RuntimeError(f"NCCL error: {error_str}")
[rank1]: RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
[rank0]: Traceback (most recent call last):
[rank0]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/launcher.py", line 23, in <module>
[rank0]:     launch()
[rank0]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/launcher.py", line 19, in launch
[rank0]:     run_exp()
[rank0]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/train/tuner.py", line 61, in run_exp
[rank0]:     run_cwc(model_args, data_args, training_args, finetuning_args, callbacks)
[rank0]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/train/cwc/workflow.py", line 70, in run_cwc
[rank0]:     source_model = LLM(model=model_args.cwc_source_model_name_or_path, tensor_parallel_size=training_args.world_size)
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 214, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 564, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 325, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 110, in _init_executor
[rank0]:     self._run_workers("init_device")
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/worker/worker.py", line 176, in init_device
[rank0]:     init_worker_distributed_environment(self.parallel_config, self.rank,
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/worker/worker.py", line 451, in init_worker_distributed_environment
[rank0]:     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 1059, in ensure_model_parallel_initialized
[rank0]:     initialize_model_parallel(tensor_model_parallel_size,
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 1023, in initialize_model_parallel
[rank0]:     _TP = init_model_parallel_group(group_ranks,
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 864, in init_model_parallel_group
[rank0]:     return GroupCoordinator(
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 214, in __init__
[rank0]:     self.pynccl_comm = PyNcclCommunicator(
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__
[rank0]:     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank
[rank0]:     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
[rank0]:     raise RuntimeError(f"NCCL error: {error_str}")
[rank0]: RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)

Then I check my driver and hardware following the python script provided in https://docs.vllm.ai/en/latest/getting_started/debugging.html. Below is the report. It seems that everything is okay.

driver & hardware checking report
CUDA_VISIBLE_DEVICES=6,7 NCCL_DEBUG=TRACE torchrun --nproc-per-node=2 test_vllm_env.py
W1017 23:04:55.958511 23456244184000 torch/distributed/run.py:779] 
W1017 23:04:55.958511 23456244184000 torch/distributed/run.py:779] *****************************************
W1017 23:04:55.958511 23456244184000 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1017 23:04:55.958511 23456244184000 torch/distributed/run.py:779] *****************************************
node05:3662915:3662915 [0] NCCL INFO Bootstrap : Using ibs110:192.168.99.105<0>
node05:3662915:3662915 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
node05:3662915:3662915 [0] NCCL INFO cudaDriverVersion 12010
NCCL version 2.20.5+cuda12.4
node05:3662915:3663002 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [RO]; OOB ibs110:192.168.99.105<0>
node05:3662915:3663002 [0] NCCL INFO Using non-device net plugin version 0
node05:3662915:3663002 [0] NCCL INFO Using network IB
node05:3662916:3662916 [1] NCCL INFO cudaDriverVersion 12010
node05:3662916:3662916 [1] NCCL INFO Bootstrap : Using ibs110:192.168.99.105<0>
node05:3662916:3662916 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
node05:3662916:3663010 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [RO]; OOB ibs110:192.168.99.105<0>
node05:3662916:3663010 [1] NCCL INFO Using non-device net plugin version 0
node05:3662916:3663010 [1] NCCL INFO Using network IB
node05:3662915:3663002 [0] NCCL INFO comm 0x5555b6a34880 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId 9c000 commId 0x194370b6111a740b - Init START
node05:3662916:3663010 [1] NCCL INFO comm 0x5555b6a32e10 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId 9e000 commId 0x194370b6111a740b - Init START
node05:3662915:3663002 [0] NCCL INFO Setting affinity for GPU 6 to ffff,00000000,0000ffff,00000000
node05:3662916:3663010 [1] NCCL INFO Setting affinity for GPU 7 to ffff,00000000,0000ffff,00000000
node05:3662916:3663010 [1] NCCL INFO comm 0x5555b6a32e10 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
node05:3662916:3663010 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
node05:3662916:3663010 [1] NCCL INFO P2P Chunksize set to 131072
node05:3662915:3663002 [0] NCCL INFO comm 0x5555b6a34880 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
node05:3662915:3663002 [0] NCCL INFO Channel 00/04 :    0   1
node05:3662915:3663002 [0] NCCL INFO Channel 01/04 :    0   1
node05:3662915:3663002 [0] NCCL INFO Channel 02/04 :    0   1
node05:3662915:3663002 [0] NCCL INFO Channel 03/04 :    0   1
node05:3662915:3663002 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
node05:3662915:3663002 [0] NCCL INFO P2P Chunksize set to 131072
node05:3662916:3663010 [1] NCCL INFO Channel 00/0 : 1[7] -> 0[6] via P2P/CUMEM
node05:3662915:3663002 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[7] via P2P/CUMEM
node05:3662916:3663010 [1] NCCL INFO Channel 01/0 : 1[7] -> 0[6] via P2P/CUMEM
node05:3662915:3663002 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[7] via P2P/CUMEM
node05:3662916:3663010 [1] NCCL INFO Channel 02/0 : 1[7] -> 0[6] via P2P/CUMEM
node05:3662915:3663002 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[7] via P2P/CUMEM
node05:3662916:3663010 [1] NCCL INFO Channel 03/0 : 1[7] -> 0[6] via P2P/CUMEM
node05:3662915:3663002 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[7] via P2P/CUMEM
node05:3662916:3663010 [1] NCCL INFO Connected all rings
node05:3662916:3663010 [1] NCCL INFO Connected all trees
node05:3662916:3663010 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
node05:3662916:3663010 [1] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
node05:3662915:3663002 [0] NCCL INFO Connected all rings
node05:3662915:3663002 [0] NCCL INFO Connected all trees
node05:3662915:3663002 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
node05:3662915:3663002 [0] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
node05:3662916:3663010 [1] NCCL INFO comm 0x5555b6a32e10 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId 9e000 commId 0x194370b6111a740b - Init COMPLETE
node05:3662915:3663002 [0] NCCL INFO comm 0x5555b6a34880 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId 9c000 commId 0x194370b6111a740b - Init COMPLETE
PyTorch NCCL is successful!
PyTorch NCCL is successful!
PyTorch GLOO is successful!
PyTorch GLOO is successful!
/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
INFO 10-17 23:05:03 utils.py:993] Found nccl from library libnccl.so.2
INFO 10-17 23:05:03 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 10-17 23:05:03 utils.py:993] Found nccl from library libnccl.so.2
INFO 10-17 23:05:03 pynccl.py:63] vLLM is using nccl==2.20.5
node05:3662916:3662916 [1] NCCL INFO Using non-device net plugin version 0
node05:3662916:3662916 [1] NCCL INFO Using network IB
node05:3662915:3662915 [0] NCCL INFO Using non-device net plugin version 0
node05:3662915:3662915 [0] NCCL INFO Using network IB
node05:3662915:3662915 [0] NCCL INFO comm 0x5555cbb18260 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId 9c000 commId 0x34dae9aadb417959 - Init START
node05:3662916:3662916 [1] NCCL INFO comm 0x5555cbb17600 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId 9e000 commId 0x34dae9aadb417959 - Init START
node05:3662915:3662915 [0] NCCL INFO Setting affinity for GPU 6 to ffff,00000000,0000ffff,00000000
node05:3662916:3662916 [1] NCCL INFO Setting affinity for GPU 7 to ffff,00000000,0000ffff,00000000
node05:3662916:3662916 [1] NCCL INFO comm 0x5555cbb17600 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
node05:3662916:3662916 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
node05:3662916:3662916 [1] NCCL INFO P2P Chunksize set to 131072
node05:3662915:3662915 [0] NCCL INFO comm 0x5555cbb18260 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
node05:3662915:3662915 [0] NCCL INFO Channel 00/04 :    0   1
node05:3662915:3662915 [0] NCCL INFO Channel 01/04 :    0   1
node05:3662915:3662915 [0] NCCL INFO Channel 02/04 :    0   1
node05:3662915:3662915 [0] NCCL INFO Channel 03/04 :    0   1
node05:3662915:3662915 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
node05:3662915:3662915 [0] NCCL INFO P2P Chunksize set to 131072
node05:3662916:3662916 [1] NCCL INFO Channel 00/0 : 1[7] -> 0[6] via P2P/CUMEM
node05:3662916:3662916 [1] NCCL INFO Channel 01/0 : 1[7] -> 0[6] via P2P/CUMEM
node05:3662916:3662916 [1] NCCL INFO Channel 02/0 : 1[7] -> 0[6] via P2P/CUMEM
node05:3662916:3662916 [1] NCCL INFO Channel 03/0 : 1[7] -> 0[6] via P2P/CUMEM
node05:3662915:3662915 [0] NCCL INFO Channel 00/0 : 0[6] -> 1[7] via P2P/CUMEM
node05:3662915:3662915 [0] NCCL INFO Channel 01/0 : 0[6] -> 1[7] via P2P/CUMEM
node05:3662915:3662915 [0] NCCL INFO Channel 02/0 : 0[6] -> 1[7] via P2P/CUMEM
node05:3662915:3662915 [0] NCCL INFO Channel 03/0 : 0[6] -> 1[7] via P2P/CUMEM
node05:3662916:3662916 [1] NCCL INFO Connected all rings
node05:3662916:3662916 [1] NCCL INFO Connected all trees
node05:3662915:3662915 [0] NCCL INFO Connected all rings
node05:3662916:3662916 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
node05:3662916:3662916 [1] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
node05:3662915:3662915 [0] NCCL INFO Connected all trees
node05:3662915:3662915 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
node05:3662915:3662915 [0] NCCL INFO 4 coll channels, 0 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
node05:3662915:3662915 [0] NCCL INFO comm 0x5555cbb18260 rank 0 nranks 2 cudaDev 0 nvmlDev 6 busId 9c000 commId 0x34dae9aadb417959 - Init COMPLETE
node05:3662916:3662916 [1] NCCL INFO comm 0x5555cbb17600 rank 1 nranks 2 cudaDev 1 nvmlDev 7 busId 9e000 commId 0x34dae9aadb417959 - Init COMPLETE
vLLM NCCL is successful!
vLLM NCCL is successful!
vLLM NCCL with cuda graph is successful!
node05:3662915:3663042 [0] NCCL INFO [Service thread] Connection closed by localRank 0
vLLM NCCL with cuda graph is successful!
node05:3662916:3663040 [1] NCCL INFO [Service thread] Connection closed by localRank 1
node05:3662915:3663083 [0] NCCL INFO comm 0x5555b6a34880 rank 0 nranks 2 cudaDev 0 busId 9c000 - Abort COMPLETE
node05:3662916:3663084 [1] NCCL INFO comm 0x5555b6a32e10 rank 1 nranks 2 cudaDev 1 busId 9e000 - Abort COMPLETE

Is there any way to fix this bug? @DarkLight1337 @youkaichao

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@takagi97 takagi97 added the bug Something isn't working label Oct 17, 2024
@takagi97 takagi97 changed the title [Bug]: I want to integrate vllm into LLaMA-Factory, a transformers-based LLM training framework. However, I encountered a bug: RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method & RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details) [Bug]: I want to integrate vllm into LLaMA-Factory, a transformers-based LLM training framework. However, I encountered two bugs: RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method & RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details) Oct 17, 2024
@mgoin
Copy link
Member

mgoin commented Oct 17, 2024

Can you try running vLLM with VLLM_WORKER_MULTIPROC_METHOD=spawn set?

@russellb
Copy link
Member

I'm also interested if 0.6.3 behaves better for you. It includes this change: #8823

With that change, vllm should automatically use spawn once it detects that cuda was already initiatlized.

@takagi97
Copy link
Author

Can you try running vLLM with VLLM_WORKER_MULTIPROC_METHOD=spawn set?

I tried this, but got a NCCL bug:

NCCL bug
[rank1]: Traceback (most recent call last):
[rank1]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/launcher.py", line 23, in <module>
[rank1]:     launch()
[rank1]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/launcher.py", line 19, in launch
[rank1]:     run_exp()
[rank1]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/train/tuner.py", line 61, in run_exp
[rank1]:     run_cwc(model_args, data_args, training_args, finetuning_args, callbacks)
[rank1]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/train/cwc/workflow.py", line 70, in run_cwc
[rank1]:     source_model = LLM(model=model_args.cwc_source_model_name_or_path, tensor_parallel_size=training_args.world_size)
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 214, in __init__
[rank1]:     self.llm_engine = LLMEngine.from_engine_args(
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 564, in from_engine_args
[rank1]:     engine = cls(
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 325, in __init__
[rank1]:     self.model_executor = executor_class(
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
[rank1]:     super().__init__(*args, **kwargs)
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
[rank1]:     self._init_executor()
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 110, in _init_executor
[rank1]:     self._run_workers("init_device")
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers
[rank1]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/worker/worker.py", line 176, in init_device
[rank1]:     init_worker_distributed_environment(self.parallel_config, self.rank,
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/worker/worker.py", line 451, in init_worker_distributed_environment
[rank1]:     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 1059, in ensure_model_parallel_initialized
[rank1]:     initialize_model_parallel(tensor_model_parallel_size,
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 1023, in initialize_model_parallel
[rank1]:     _TP = init_model_parallel_group(group_ranks,
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 864, in init_model_parallel_group
[rank1]:     return GroupCoordinator(
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 214, in __init__
[rank1]:     self.pynccl_comm = PyNcclCommunicator(
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__
[rank1]:     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank
[rank1]:     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
[rank1]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
[rank1]:     raise RuntimeError(f"NCCL error: {error_str}")
[rank1]: RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)
[rank0]: Traceback (most recent call last):
[rank0]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/launcher.py", line 23, in <module>
[rank0]:     launch()
[rank0]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/launcher.py", line 19, in launch
[rank0]:     run_exp()
[rank0]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/train/tuner.py", line 61, in run_exp
[rank0]:     run_cwc(model_args, data_args, training_args, finetuning_args, callbacks)
[rank0]:   File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/train/cwc/workflow.py", line 70, in run_cwc
[rank0]:     source_model = LLM(model=model_args.cwc_source_model_name_or_path, tensor_parallel_size=training_args.world_size)
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 214, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 564, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 325, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 110, in _init_executor
[rank0]:     self._run_workers("init_device")
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/worker/worker.py", line 176, in init_device
[rank0]:     init_worker_distributed_environment(self.parallel_config, self.rank,
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/worker/worker.py", line 451, in init_worker_distributed_environment
[rank0]:     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 1059, in ensure_model_parallel_initialized
[rank0]:     initialize_model_parallel(tensor_model_parallel_size,
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 1023, in initialize_model_parallel
[rank0]:     _TP = init_model_parallel_group(group_ranks,
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 864, in init_model_parallel_group
[rank0]:     return GroupCoordinator(
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 214, in __init__
[rank0]:     self.pynccl_comm = PyNcclCommunicator(
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 89, in __init__
[rank0]:     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 244, in ncclCommInitRank
[rank0]:     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
[rank0]:   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 223, in NCCL_CHECK
[rank0]:     raise RuntimeError(f"NCCL error: {error_str}")
[rank0]: RuntimeError: NCCL error: invalid usage (run with NCCL_DEBUG=WARN for details)

@russellb
Copy link
Member

error: invalid usage (run with NCCL_DEBUG=WARN for details)

What do you get when you set NCCL_DEBUG=warn?

@takagi97
Copy link
Author

export VLLM_WORKER_MULTIPROC_METHOD=spawn

I tried vllm 0.6.3.post2.dev12+g1ffc8a73 with the test demo below, however, there is still a spawn error.

demo code:

python

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1,2"

from transformers.utils import (
    is_torch_bf16_gpu_available,
    is_torch_cuda_available,
    is_torch_mps_available,
    is_torch_npu_available,
    is_torch_xpu_available,
)
is_torch_cuda_available()

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "Hello, my name is",
    "Hello, my name is"
]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="/localnvme/application/sc_new/muyongyu/original_weight/llama-3-8b", tensor_parallel_size=2)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

ERROR:

(VllmWorkerProcess pid=2731824) INFO 10-18 16:04:07 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2731824) ERROR 10-18 16:04:07 multiproc_worker_utils.py:229] Exception in worker VllmWorkerProcess while processing method init_device.
(VllmWorkerProcess pid=2731824) ERROR 10-18 16:04:07 multiproc_worker_utils.py:229] Traceback (most recent call last):
(VllmWorkerProcess pid=2731824) ERROR 10-18 16:04:07 multiproc_worker_utils.py:229]   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=2731824) ERROR 10-18 16:04:07 multiproc_worker_utils.py:229]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=2731824) ERROR 10-18 16:04:07 multiproc_worker_utils.py:229]   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/vllm/worker/worker.py", line 166, in init_device
(VllmWorkerProcess pid=2731824) ERROR 10-18 16:04:07 multiproc_worker_utils.py:229]     torch.cuda.set_device(self.device)
(VllmWorkerProcess pid=2731824) ERROR 10-18 16:04:07 multiproc_worker_utils.py:229]   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/torch/cuda/__init__.py", line 420, in set_device
(VllmWorkerProcess pid=2731824) ERROR 10-18 16:04:07 multiproc_worker_utils.py:229]     torch._C._cuda_setDevice(device)
(VllmWorkerProcess pid=2731824) ERROR 10-18 16:04:07 multiproc_worker_utils.py:229]   File "/localnvme/application/sc_new/miniconda3/envs/cwc/lib/python3.10/site-packages/torch/cuda/__init__.py", line 300, in _lazy_init
(VllmWorkerProcess pid=2731824) ERROR 10-18 16:04:07 multiproc_worker_utils.py:229]     raise RuntimeError(
(VllmWorkerProcess pid=2731824) ERROR 10-18 16:04:07 multiproc_worker_utils.py:229] RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

@youkaichao
Copy link
Member

one solution is to use vllm's openai api server, then you don't need to worry about the problem.

Can you try running vLLM with VLLM_WORKER_MULTIPROC_METHOD=spawn set?

I tried this, but got a NCCL bug:

@takagi97 you didn't provide the detailed NCCL log with export VLLM_WORKER_MULTIPROC_METHOD=spawn; export NCCL_DEBUG=TRACE

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Jan 22, 2025
@songwang41
Copy link

I also encountered this issue. any solution available?

@github-actions github-actions bot added unstale and removed stale labels Jan 26, 2025
@DarkLight1337
Copy link
Member

See #12084

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working unstale
Projects
None yet
Development

No branches or pull requests

6 participants