bugs in trained save and evaluation for DPO training with deepspeed_zero3 #1584

yananchen1989 · 2024-04-24T21:59:58Z

i follow the usage guidance in https://huggingface.co/docs/trl/v0.8.5/en/customization for multi gpu DPO training.
Here is the yaml config:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 4
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'yes'
machine_rank: 0
main_training_function: main
mixed_precision: 'bf16'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true
main_process_port: 29525
gpu_ids: 0,1,2,3,4,5,6,7

launch script:

accelerate launch \
    --config_file=examples/accelerate_configs/deepspeed_zero3.yaml \
        examples/scripts/dpo.py \
    --attn_implementation 'flash_attention_2' \
    --model_name_or_path="mistralai/Mistral-7B-Instruct-v0.2" \
    --per_device_train_batch_size 1 \
    --learning_rate 4e-4 \
    --gradient_accumulation_steps 4 \
    --logging_steps 1 \
    --output_dir="dpo_tp_mistral" \
    --optim rmsprop \
    --warmup_steps 150 \
    --report_to 'none' \
    --bf16 \
    --logging_first_step \
    --max_steps=-1 \
    --gradient_checkpointing \
    --no_remove_unused_columns \
    --do_eval False \
    --evaluation_strategy 'no' \
    --max_prompt_length 18000 \
    --max_length 22000 \
    --num_train_epochs 4 \
    --save_strategy "epoch" \
     --torch_dtype 'bfloat16' \
     --bf16_full_eval True   \
     --logging_strategy  "epoch"

my script works well with SFT, but not with DPO.
The training process seems fine, but once I add evaluation (--do_eval True and --evaluation_strategy 'epoch'), it crashed.
Then I tried to remove the evaluation, the training proceed normally, but at the point to trainer.save_model, it crashed too.

The error log is the same for above situations.
Kindly ask for any help to solve it.
Thanks.

[rank5]:[E ProcessGroupNCCL.cpp:523] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=98947, OpType=_ALLGATHER_BASE, NumelIn=16384000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800114 milliseconds before timing out.
[rank4]:[E ProcessGroupNCCL.cpp:523] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=98947, OpType=_ALLGATHER_BASE, NumelIn=16384000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800094 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=98947, OpType=_ALLGATHER_BASE, NumelIn=16384000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800236 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:523] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=98947, OpType=_ALLGATHER_BASE, NumelIn=16384000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800437 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=98947, OpType=_ALLGATHER_BASE, NumelIn=16384000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800629 milliseconds before timing out.
[rank4]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E ProcessGroupNCCL.cpp:1182] [Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=98947, OpType=_ALLGATHER_BASE, NumelIn=16384000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800094 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3f9f187d87 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f3fa032f6e6 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f3fa0332c3d in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f3fa0333839 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd6df4 (0x7f3ff31a4df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x8609 (0x7f3ffbd2d609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f3ffbaf8353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=98947, OpType=_ALLGATHER_BASE, NumelIn=16384000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800094 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3f9f187d87 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f3fa032f6e6 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f3fa0332c3d in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f3fa0333839 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd6df4 (0x7f3ff31a4df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x8609 (0x7f3ffbd2d609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f3ffbaf8353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f3f9f187d87 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xdf6b11 (0x7f3fa0089b11 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd6df4 (0x7f3ff31a4df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x8609 (0x7f3ffbd2d609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f3ffbaf8353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank6]:[E ProcessGroupNCCL.cpp:523] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=98947, OpType=_ALLGATHER_BASE, NumelIn=16384000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800979 milliseconds before timing out.
[rank5]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank5]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank5]:[E ProcessGroupNCCL.cpp:1182] [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=98947, OpType=_ALLGATHER_BASE, NumelIn=16384000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800114 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f309feb8d87 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f30a10606e6 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f30a1063c3d in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f30a1064839 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd6df4 (0x7f30f3ed5df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x8609 (0x7f30fca5e609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f30fc829353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=98947, OpType=_ALLGATHER_BASE, NumelIn=16384000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800114 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f309feb8d87 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f30a10606e6 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f30a1063c3d in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f30a1064839 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd6df4 (0x7f30f3ed5df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x8609 (0x7f30fca5e609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f30fc829353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f309feb8d87 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xdf6b11 (0x7f30a0dbab11 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd6df4 (0x7f30f3ed5df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x8609 (0x7f30fca5e609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f30fc829353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1182] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=98947, OpType=_ALLGATHER_BASE, NumelIn=16384000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800437 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5049fafd87 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f504b1576e6 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f504b15ac3d in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f504b15b839 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd6df4 (0x7f509dfccdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x8609 (0x7f50a6b55609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f50a6920353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=98947, OpType=_ALLGATHER_BASE, NumelIn=16384000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800437 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5049fafd87 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f504b1576e6 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f504b15ac3d in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f504b15b839 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd6df4 (0x7f509dfccdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x8609 (0x7f50a6b55609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f50a6920353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5049fafd87 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xdf6b11 (0x7f504aeb1b11 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd6df4 (0x7f509dfccdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x8609 (0x7f50a6b55609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f50a6920353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=98947, OpType=_ALLGATHER_BASE, NumelIn=16384000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800236 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbcada96d87 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fbcaec3e6e6 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fbcaec41c3d in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fbcaec42839 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd6df4 (0x7fbd01ab4df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x8609 (0x7fbd0a63d609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fbd0a408353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=98947, OpType=_ALLGATHER_BASE, NumelIn=16384000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800236 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbcada96d87 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fbcaec3e6e6 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fbcaec41c3d in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fbcaec42839 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd6df4 (0x7fbd01ab4df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x8609 (0x7fbd0a63d609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fbd0a408353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbcada96d87 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xdf6b11 (0x7fbcae998b11 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd6df4 (0x7fbd01ab4df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x8609 (0x7fbd0a63d609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7fbd0a408353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=98947, OpType=_ALLGATHER_BASE, NumelIn=16384000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800629 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb984365d87 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fb98550d6e6 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fb985510c3d in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fb985511839 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd6df4 (0x7fb9d8382df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x8609 (0x7fb9e0f0b609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fb9e0cd6353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=98947, OpType=_ALLGATHER_BASE, NumelIn=16384000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800629 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb984365d87 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fb98550d6e6 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fb985510c3d in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fb985511839 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd6df4 (0x7fb9d8382df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x8609 (0x7fb9e0f0b609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fb9e0cd6353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb984365d87 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xdf6b11 (0x7fb985267b11 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd6df4 (0x7fb9d8382df4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x8609 (0x7fb9e0f0b609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7fb9e0cd6353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank6]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank6]:[E ProcessGroupNCCL.cpp:1182] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=98947, OpType=_ALLGATHER_BASE, NumelIn=16384000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800979 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f114b911d87 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f114cab96e6 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f114cabcc3d in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f114cabd839 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd6df4 (0x7f119f92edf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x8609 (0x7f11a84b7609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f11a8282353 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=98947, OpType=_ALLGATHER_BASE, NumelIn=16384000, NumelOut=131072000, Timeout(ms)=1800000) ran for 1800979 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f114b911d87 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f114cab96e6 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f114cabcc3d in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f114cabd839 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd6df4 (0x7f119f92edf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x8609 (0x7f11a84b7609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f11a8282353 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f114b911d87 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xdf6b11 (0x7f114c813b11 in /home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd6df4 (0x7f119f92edf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x8609 (0x7f11a84b7609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f11a8282353 in /lib/x86_64-linux-gnu/libc.so.6)

[2024-04-24 17:49:59,412] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 242308 closing signal SIGTERM
[2024-04-24 17:49:59,412] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 242309 closing signal SIGTERM
[2024-04-24 17:49:59,412] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 242310 closing signal SIGTERM
[2024-04-24 17:49:59,413] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 242311 closing signal SIGTERM
[2024-04-24 17:49:59,413] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 242314 closing signal SIGTERM
[2024-04-24 17:49:59,414] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 242315 closing signal SIGTERM
[2024-04-24 17:50:03,391] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 4 (pid: 242312) of binary: /home/chenyanan/anaconda3/envs/mp/bin/python
Traceback (most recent call last):
File "/home/chenyanan/anaconda3/envs/mp/bin/accelerate", line 8, in
sys.exit(main())
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1060, in launch_command
deepspeed_launcher(args)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/accelerate/commands/launch.py", line 764, in deepspeed_launcher
distrib_run.run(args)
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/chenyanan/anaconda3/envs/mp/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/chenyanan/trl/examples/scripts/dpo_tp.py FAILED

Failures:
[1]:
time : 2024-04-24_17:49:59
host : A40-36-111-143-5
rank : 5 (local_rank: 5)
exitcode : -6 (pid: 242313)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 242313

Root Cause (first observed failure):
[0]:
time : 2024-04-24_17:49:59
host : A40-36-111-143-5
rank : 4 (local_rank: 4)
exitcode : -6 (pid: 242312)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 242312

The text was updated successfully, but these errors were encountered:

yananchen1989 · 2024-04-24T22:03:57Z

my model saving code:

yananchen1989 · 2024-04-24T22:04:02Z

    with save_context:
        trainer.model.generation_config = transformers.GenerationConfig(temperature=None, top_p=None)
        trainer.save_model(training_args.output_dir)

yananchen1989 · 2024-04-24T22:26:29Z

a related post #1121 is there any place I can change in dpo training script ?

renmengjie7 · 2024-09-23T09:44:24Z

hello.

when i try offload in zero3 for dpo, i met the error

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

I see you train successfully, what is your trl version, please?

yananchen1989 closed this as completed May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bugs in trained save and evaluation for DPO training with deepspeed_zero3 #1584

bugs in trained save and evaluation for DPO training with deepspeed_zero3 #1584

yananchen1989 commented Apr 24, 2024 •

edited

Loading

/home/chenyanan/trl/examples/scripts/dpo_tp.py FAILED

Failures:
[1]:
time : 2024-04-24_17:49:59
host : A40-36-111-143-5
rank : 5 (local_rank: 5)
exitcode : -6 (pid: 242313)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 242313

Root Cause (first observed failure):
[0]:
time : 2024-04-24_17:49:59
host : A40-36-111-143-5
rank : 4 (local_rank: 4)
exitcode : -6 (pid: 242312)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 242312

yananchen1989 commented Apr 24, 2024

yananchen1989 commented Apr 24, 2024

yananchen1989 commented Apr 24, 2024

renmengjie7 commented Sep 23, 2024

bugs in trained save and evaluation for DPO training with deepspeed_zero3 #1584

bugs in trained save and evaluation for DPO training with deepspeed_zero3 #1584

Comments

yananchen1989 commented Apr 24, 2024 • edited Loading

/home/chenyanan/trl/examples/scripts/dpo_tp.py FAILED

Failures: [1]: time : 2024-04-24_17:49:59 host : A40-36-111-143-5 rank : 5 (local_rank: 5) exitcode : -6 (pid: 242313) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 242313

Root Cause (first observed failure): [0]: time : 2024-04-24_17:49:59 host : A40-36-111-143-5 rank : 4 (local_rank: 4) exitcode : -6 (pid: 242312) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 242312

yananchen1989 commented Apr 24, 2024

yananchen1989 commented Apr 24, 2024

yananchen1989 commented Apr 24, 2024

renmengjie7 commented Sep 23, 2024

yananchen1989 commented Apr 24, 2024 •

edited

Loading

Failures:
[1]:
time : 2024-04-24_17:49:59
host : A40-36-111-143-5
rank : 5 (local_rank: 5)
exitcode : -6 (pid: 242313)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 242313

Root Cause (first observed failure):
[0]:
time : 2024-04-24_17:49:59
host : A40-36-111-143-5
rank : 4 (local_rank: 4)
exitcode : -6 (pid: 242312)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 242312