Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] terminate called after throwing an instance of 'std::bad_alloc' #3126

Open
shisi-cc opened this issue Apr 3, 2023 · 4 comments
Open
Labels
deepspeed-chat Related to DeepSpeed-Chat

Comments

@shisi-cc
Copy link

shisi-cc commented Apr 3, 2023

Describe the bug
When I run the code rlhf with trlx using deepspeed with two nodes, I met a strange problem "terminate called after throwing an instance of 'std::bad_alloc'". Memory and video memory are far from used up. Running on a separate machine works fine, but errors occur when two nodes are used. This problem occurs when I run with a docker container, but not when I don't use a container. In addition, I use anaconda environment.

ds_report output
(trlx_env) root@9a3cd98dd64f:/data/work/trlx_rlhf/sft# deepspeed --hostfile=../../hostfile train_gptj_summarize.py
[2023-04-03 10:49:33,397] [INFO] [runner.py:454:main] Using IP address of 10.0.128.5 for node localhost
[2023-04-03 10:49:33,398] [INFO] [multinode_runner.py:65:get_cmd] Running on the following workers: localhost,deepspeed-18
[2023-04-03 10:49:33,398] [INFO] [runner.py:548:main] cmd = pdsh -S -f 1024 -w localhost,deepspeed-18 export PYTHONPATH=/data/work/trlx_rlhf/sft; cd /data/work/trlx_rlhf/sft; /root/mambaforge/envs/trlx_env/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF0sICJkZWVwc3BlZWQtMTgiOiBbMF19 --node_rank=%n --master_addr=10.0.128.5 --master_port=29500 train_gptj_summarize.py
deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0], 'deepspeed-18': [0]}
deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=1, node_rank=1
deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0], 'deepspeed-18': [1]})
deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:162:main] dist_world_size=2
deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0], 'deepspeed-18': [0]}
localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=1, node_rank=0
localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0], 'deepspeed-18': [1]})
localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:162:main] dist_world_size=2
localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
deepspeed-18: Tokenizer loaded!
localhost: Tokenizer loaded!
deepspeed-18: Model loaded!
deepspeed-18: Downloading and preparing dataset parquet/openai_summarize_tldr to /root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-bed27f7b4c8f201f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...
localhost: Model loaded!
localhost: Found cached dataset parquet (/root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-0275f923823d6c0b/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
localhost: Found cached dataset parquet (/root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-0275f923823d6c0b/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
localhost: Dataset loaded!
localhost: [2023-04-03 10:50:46,311] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Downloading data files: 100%|██████████| 3/3 [00:00<00:00, 10941.66it/s]
Extracting data files: 100%|██████████| 3/3 [00:00<00:00, 1896.44it/s]
deepspeed-18: Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-bed27f7b4c8f201f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.
Found cached dataset parquet (/root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-bed27f7b4c8f201f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
deepspeed-18: Dataset loaded!
Downloading builder script: 100%|██████████| 6.27k/6.27k [00:00<00:00, 26.9kB/s]
deepspeed-18: terminate called after throwing an instance of 'std::bad_alloc'
deepspeed-18: what(): std::bad_alloc
deepspeed-18: [2023-04-03 10:51:15,307] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1231493
deepspeed-18: [2023-04-03 10:51:15,308] [ERROR] [launch.py:324:sigkill_handler] ['/root/mambaforge/envs/trlx_env/bin/python', '-u', 'train_gptj_summarize.py', '--local_rank=0'] exits with return code = -6
pdsh@9a3cd98dd64f: deepspeed-18: ssh exited with exit code 250

Hostfile
localhost slots=1
deepspeed-18 slots=1

Launcher context
deepspeed --hostfile=../../hostfile train_gptj_summarize.py

Docker context
This problem occurs when I run with a docker container, but not when I don't use a container.

@shisi-cc shisi-cc added bug Something isn't working training labels Apr 3, 2023
@tjruwase
Copy link
Contributor

tjruwase commented Apr 12, 2023

Please see recent DeepSpeed Chat release #3186
https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat

@tjruwase
Copy link
Contributor

@shisi-cc, did the link above help? Can this issue be closed? Thanks!

@Ancrilin
Copy link

Hi, I have encountered the same issue. I created a docker container on two different machines and ran DeepSpeed-Chat/training/step1_supervised_finetuning/muti_node/run_66b.sh, but I encountered the same error.

hostfile
node1 slots=8
node2 slots=8

ds_report_output
[2023-04-26 06:34:22,975] [INFO] [multinode_runner.py:70:get_cmd] Running on the following workers: node1,node2
[2023-04-26 06:34:22,975] [INFO] [runner.py:540:main] cmd = pdsh -S -f 1024 -w node1,node2 export NCCL_VERSION=2.12.10-1; export PYTHONPATH=/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning; cd /workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning; /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJub2RlMSI6IFszLCA1XSwgIm5vZGUyIjogWzAsIDFdfQ== --node_rank=%n --master_addr=10.176.50.36 --master_port=32783 main.py --data_path 'Dahoas/rm-static' --data_split '2,4,4' --model_name_or_path '/workspace/models/opt-1.3b' --per_device_train_batch_size '1' --per_device_eval_batch_size '1' --max_seq_len '512' --learning_rate '9.65e-6' --weight_decay '0.1' --num_train_epochs '2' --gradient_accumulation_steps '1' --lr_scheduler_type 'cosine' --num_warmup_steps '0' --seed '1234' --zero_stage '3' --deepspeed --output_dir './output'
node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:222:main] 0 NCCL_VERSION=2.12.10-1
node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:229:main] WORLD INFO DICT: {'node1': [3, 5], 'node2': [0, 1]}
node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:235:main] nnodes=2, num_local_procs=2, node_rank=0
node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'node1': [0, 1], 'node2': [2, 3]})
node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:247:main] dist_world_size=4
node1: [2023-04-26 06:34:26,299] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=3,5
node2: [2023-04-26 06:34:28,198] [INFO] [launch.py:222:main] 1 NCCL_VERSION=2.12.10-1
node2: [2023-04-26 06:34:28,198] [INFO] [launch.py:229:main] WORLD INFO DICT: {'node1': [3, 5], 'node2': [0, 1]}
node2: [2023-04-26 06:34:28,198] [INFO] [launch.py:235:main] nnodes=2, num_local_procs=2, node_rank=1
node2: [2023-04-26 06:34:28,198] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'node1': [0, 1], 'node2': [2, 3]})
node2: [2023-04-26 06:34:28,199] [INFO] [launch.py:247:main] dist_world_size=4
node2: [2023-04-26 06:34:28,199] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1
node1: [2023-04-26 06:34:30,128] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
node1: Traceback (most recent call last):
node1: File "/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 343, in
node1: main()
node1: File "/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 185, in main
node1: deepspeed.init_distributed()
node1: File "/opt/conda/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 588, in init_distributed
node1: cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
node1: File "/opt/conda/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 32, in init
node1: self.init_process_group(backend, timeout, init_method, rank, world_size)
node1: File "/opt/conda/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 58, in init_process_group
node1: torch.distributed.init_process_group(backend,
node1: File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group
node1: store, rank, world_size = next(rendezvous_iterator)
node1: File "/opt/conda/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
node1: store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
node1: File "/opt/conda/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 177, in _create_c10d_store
node1: return TCPStore(
node1: RuntimeError: Stop_waiting response is expected
node1: terminate called after throwing an instance of 'std::bad_alloc'
node1: what(): std::bad_alloc
node1: [2023-04-26 06:34:31,325] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1846
node1: [2023-04-26 06:34:31,328] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1847
node1: [2023-04-26 06:34:31,328] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/bin/python', '-u', 'main.py', '--local_rank=1', '--data_path', 'Dahoas/rm-static', '--data_split', '2,4,4', '--model_name_or_path', '/workspace/models/opt-1.3b', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '3', '--deepspeed', '--output_dir', './output'] exits with return code = -6
pdsh@4e68f64d7185: node1: ssh exited with exit code 250
node2: terminate called after throwing an instance of 'std::bad_alloc'
node2: what(): std::bad_alloc
node2: terminate called after throwing an instance of 'std::bad_alloc'
node2: what(): std::bad_alloc
node2: [2023-04-26 06:34:37,245] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1395
node2: [2023-04-26 06:34:37,247] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1396
node2: [2023-04-26 06:34:37,247] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/bin/python', '-u', 'main.py', '--local_rank=1', '--data_path', 'Dahoas/rm-static', '--data_split', '2,4,4', '--model_name_or_path', '/workspace/models/opt-1.3b', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '3', '--deepspeed', '--output_dir', './output'] exits with return code = -6
pdsh@4e68f64d7185: node2: ssh exited with exit code 250

Launcher context
the ShmSize (shm) size is 10G
deepspeed --hostfile=hostfile
--master_port xxx --master_addr xxx
main.py ....

Both of my nodes can communicate with each other, and they are running inside docker containers. Have you found a solution to this issue yet?

@tjruwase tjruwase added deepspeed-chat Related to DeepSpeed-Chat and removed bug Something isn't working training labels May 15, 2023
@xiyue961
Copy link

Any updates? I encounted the same problem when finetuning whisper using deepspeed and multiple nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deepspeed-chat Related to DeepSpeed-Chat
Projects
None yet
Development

No branches or pull requests

4 participants