[BUG] terminate called after throwing an instance of 'std::bad_alloc' #3126

shisi-cc · 2023-04-03T03:22:11Z

Describe the bug
When I run the code rlhf with trlx using deepspeed with two nodes, I met a strange problem "terminate called after throwing an instance of 'std::bad_alloc'". Memory and video memory are far from used up. Running on a separate machine works fine, but errors occur when two nodes are used. This problem occurs when I run with a docker container, but not when I don't use a container. In addition, I use anaconda environment.

ds_report output
(trlx_env) root@9a3cd98dd64f:/data/work/trlx_rlhf/sft# deepspeed --hostfile=../../hostfile train_gptj_summarize.py
[2023-04-03 10:49:33,397] [INFO] [runner.py:454:main] Using IP address of 10.0.128.5 for node localhost
[2023-04-03 10:49:33,398] [INFO] [multinode_runner.py:65:get_cmd] Running on the following workers: localhost,deepspeed-18
[2023-04-03 10:49:33,398] [INFO] [runner.py:548:main] cmd = pdsh -S -f 1024 -w localhost,deepspeed-18 export PYTHONPATH=/data/work/trlx_rlhf/sft; cd /data/work/trlx_rlhf/sft; /root/mambaforge/envs/trlx_env/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF0sICJkZWVwc3BlZWQtMTgiOiBbMF19 --node_rank=%n --master_addr=10.0.128.5 --master_port=29500 train_gptj_summarize.py
deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0], 'deepspeed-18': [0]}
deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=1, node_rank=1
deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0], 'deepspeed-18': [1]})
deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:162:main] dist_world_size=2
deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0], 'deepspeed-18': [0]}
localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=1, node_rank=0
localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0], 'deepspeed-18': [1]})
localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:162:main] dist_world_size=2
localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
deepspeed-18: Tokenizer loaded!
localhost: Tokenizer loaded!
deepspeed-18: Model loaded!
deepspeed-18: Downloading and preparing dataset parquet/openai_summarize_tldr to /root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-bed27f7b4c8f201f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...
localhost: Model loaded!
localhost: Found cached dataset parquet (/root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-0275f923823d6c0b/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
localhost: Found cached dataset parquet (/root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-0275f923823d6c0b/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
localhost: Dataset loaded!
localhost: [2023-04-03 10:50:46,311] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Downloading data files: 100%|██████████| 3/3 [00:00<00:00, 10941.66it/s]
Extracting data files: 100%|██████████| 3/3 [00:00<00:00, 1896.44it/s]
deepspeed-18: Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-bed27f7b4c8f201f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.
Found cached dataset parquet (/root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-bed27f7b4c8f201f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
deepspeed-18: Dataset loaded!
Downloading builder script: 100%|██████████| 6.27k/6.27k [00:00<00:00, 26.9kB/s]
deepspeed-18: terminate called after throwing an instance of 'std::bad_alloc'
deepspeed-18: what(): std::bad_alloc
deepspeed-18: [2023-04-03 10:51:15,307] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1231493
deepspeed-18: [2023-04-03 10:51:15,308] [ERROR] [launch.py:324:sigkill_handler] ['/root/mambaforge/envs/trlx_env/bin/python', '-u', 'train_gptj_summarize.py', '--local_rank=0'] exits with return code = -6
pdsh@9a3cd98dd64f: deepspeed-18: ssh exited with exit code 250

Hostfile
localhost slots=1
deepspeed-18 slots=1

Launcher context
deepspeed --hostfile=../../hostfile train_gptj_summarize.py

Docker context
This problem occurs when I run with a docker container, but not when I don't use a container.

tjruwase · 2023-04-12T11:34:35Z

Please see recent DeepSpeed Chat release #3186
https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat

tjruwase · 2023-04-25T19:12:02Z

@shisi-cc, did the link above help? Can this issue be closed? Thanks!

Ancrilin · 2023-04-26T06:57:48Z

Hi, I have encountered the same issue. I created a docker container on two different machines and ran DeepSpeed-Chat/training/step1_supervised_finetuning/muti_node/run_66b.sh, but I encountered the same error.

hostfile
node1 slots=8
node2 slots=8

ds_report_output
[2023-04-26 06:34:22,975] [INFO] [multinode_runner.py:70:get_cmd] Running on the following workers: node1,node2
[2023-04-26 06:34:22,975] [INFO] [runner.py:540:main] cmd = pdsh -S -f 1024 -w node1,node2 export NCCL_VERSION=2.12.10-1; export PYTHONPATH=/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning; cd /workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning; /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJub2RlMSI6IFszLCA1XSwgIm5vZGUyIjogWzAsIDFdfQ== --node_rank=%n --master_addr=10.176.50.36 --master_port=32783 main.py --data_path 'Dahoas/rm-static' --data_split '2,4,4' --model_name_or_path '/workspace/models/opt-1.3b' --per_device_train_batch_size '1' --per_device_eval_batch_size '1' --max_seq_len '512' --learning_rate '9.65e-6' --weight_decay '0.1' --num_train_epochs '2' --gradient_accumulation_steps '1' --lr_scheduler_type 'cosine' --num_warmup_steps '0' --seed '1234' --zero_stage '3' --deepspeed --output_dir './output'
node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:222:main] 0 NCCL_VERSION=2.12.10-1
node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:229:main] WORLD INFO DICT: {'node1': [3, 5], 'node2': [0, 1]}
node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:235:main] nnodes=2, num_local_procs=2, node_rank=0
node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'node1': [0, 1], 'node2': [2, 3]})
node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:247:main] dist_world_size=4
node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=3,5
node2: [2023-04-26 06:34:28,198] [INFO] [launch.py:222:main] 1 NCCL_VERSION=2.12.10-1
node2: [2023-04-26 06:34:28,198] [INFO] [launch.py:229:main] WORLD INFO DICT: {'node1': [3, 5], 'node2': [0, 1]}
node2: [2023-04-26 06:34:28,198] [INFO] [launch.py:235:main] nnodes=2, num_local_procs=2, node_rank=1
node2: [2023-04-26 06:34:28,198] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'node1': [0, 1], 'node2': [2, 3]})
node2: [2023-04-26 06:34:28,199] [INFO] [launch.py:247:main] dist_world_size=4
node2: [2023-04-26 06:34:28,199] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1
node1: [2023-04-26 06:34:30,128] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
node1: Traceback (most recent call last):
node1: File "/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 343, in
node1: main()
node1: File "/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 185, in main
node1: deepspeed.init_distributed()
node1: File "/opt/conda/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 588, in init_distributed
node1: cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
node1: File "/opt/conda/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 32, in init
node1: self.init_process_group(backend, timeout, init_method, rank, world_size)
node1: File "/opt/conda/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 58, in init_process_group
node1: torch.distributed.init_process_group(backend,
node1: File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group
node1: store, rank, world_size = next(rendezvous_iterator)
node1: File "/opt/conda/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
node1: store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
node1: File "/opt/conda/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 177, in _create_c10d_store
node1: return TCPStore(
node1: RuntimeError: Stop_waiting response is expected
node1: terminate called after throwing an instance of 'std::bad_alloc'
node1: what(): std::bad_alloc
node1: [2023-04-26 06:34:31,325] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1846
node1: [2023-04-26 06:34:31,328] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1847
node1: [2023-04-26 06:34:31,328] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/bin/python', '-u', 'main.py', '--local_rank=1', '--data_path', 'Dahoas/rm-static', '--data_split', '2,4,4', '--model_name_or_path', '/workspace/models/opt-1.3b', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '3', '--deepspeed', '--output_dir', './output'] exits with return code = -6
pdsh@4e68f64d7185: node1: ssh exited with exit code 250
node2: terminate called after throwing an instance of 'std::bad_alloc'
node2: what(): std::bad_alloc
node2: terminate called after throwing an instance of 'std::bad_alloc'
node2: what(): std::bad_alloc
node2: [2023-04-26 06:34:37,245] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1395
node2: [2023-04-26 06:34:37,247] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1396
node2: [2023-04-26 06:34:37,247] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/bin/python', '-u', 'main.py', '--local_rank=1', '--data_path', 'Dahoas/rm-static', '--data_split', '2,4,4', '--model_name_or_path', '/workspace/models/opt-1.3b', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '3', '--deepspeed', '--output_dir', './output'] exits with return code = -6
pdsh@4e68f64d7185: node2: ssh exited with exit code 250

Launcher context
the ShmSize (shm) size is 10G
deepspeed --hostfile=hostfile
--master_port xxx --master_addr xxx
main.py ....

Both of my nodes can communicate with each other, and they are running inside docker containers. Have you found a solution to this issue yet?

xiyue961 · 2023-11-10T02:42:12Z

Any updates? I encounted the same problem when finetuning whisper using deepspeed and multiple nodes.

shisi-cc added bug Something isn't working training labels Apr 3, 2023

tjruwase added deepspeed-chat Related to DeepSpeed-Chat and removed bug Something isn't working training labels May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] terminate called after throwing an instance of 'std::bad_alloc' #3126

[BUG] terminate called after throwing an instance of 'std::bad_alloc' #3126

shisi-cc commented Apr 3, 2023

tjruwase commented Apr 12, 2023 •

edited

Loading

tjruwase commented Apr 25, 2023

Ancrilin commented Apr 26, 2023

xiyue961 commented Nov 10, 2023

[BUG] terminate called after throwing an instance of 'std::bad_alloc' #3126

[BUG] terminate called after throwing an instance of 'std::bad_alloc' #3126

Comments

shisi-cc commented Apr 3, 2023

tjruwase commented Apr 12, 2023 • edited Loading

tjruwase commented Apr 25, 2023

Ancrilin commented Apr 26, 2023

xiyue961 commented Nov 10, 2023

tjruwase commented Apr 12, 2023 •

edited

Loading