-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] terminate called after throwing an instance of 'std::bad_alloc' #3126
Comments
Please see recent DeepSpeed Chat release #3186 |
@shisi-cc, did the link above help? Can this issue be closed? Thanks! |
Hi, I have encountered the same issue. I created a docker container on two different machines and ran DeepSpeed-Chat/training/step1_supervised_finetuning/muti_node/run_66b.sh, but I encountered the same error. hostfile ds_report_output Launcher context Both of my nodes can communicate with each other, and they are running inside docker containers. Have you found a solution to this issue yet? |
Any updates? I encounted the same problem when finetuning whisper using deepspeed and multiple nodes. |
Describe the bug
When I run the code rlhf with trlx using deepspeed with two nodes, I met a strange problem "terminate called after throwing an instance of 'std::bad_alloc'". Memory and video memory are far from used up. Running on a separate machine works fine, but errors occur when two nodes are used. This problem occurs when I run with a docker container, but not when I don't use a container. In addition, I use anaconda environment.
ds_report output
(trlx_env) root@9a3cd98dd64f:/data/work/trlx_rlhf/sft# deepspeed --hostfile=../../hostfile train_gptj_summarize.py
[2023-04-03 10:49:33,397] [INFO] [runner.py:454:main] Using IP address of 10.0.128.5 for node localhost
[2023-04-03 10:49:33,398] [INFO] [multinode_runner.py:65:get_cmd] Running on the following workers: localhost,deepspeed-18
[2023-04-03 10:49:33,398] [INFO] [runner.py:548:main] cmd = pdsh -S -f 1024 -w localhost,deepspeed-18 export PYTHONPATH=/data/work/trlx_rlhf/sft; cd /data/work/trlx_rlhf/sft; /root/mambaforge/envs/trlx_env/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF0sICJkZWVwc3BlZWQtMTgiOiBbMF19 --node_rank=%n --master_addr=10.0.128.5 --master_port=29500 train_gptj_summarize.py
deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0], 'deepspeed-18': [0]}
deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=1, node_rank=1
deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0], 'deepspeed-18': [1]})
deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:162:main] dist_world_size=2
deepspeed-18: [2023-04-03 10:49:35,192] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0], 'deepspeed-18': [0]}
localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=1, node_rank=0
localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0], 'deepspeed-18': [1]})
localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:162:main] dist_world_size=2
localhost: [2023-04-03 10:49:35,240] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
deepspeed-18: Tokenizer loaded!
localhost: Tokenizer loaded!
deepspeed-18: Model loaded!
deepspeed-18: Downloading and preparing dataset parquet/openai_summarize_tldr to /root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-bed27f7b4c8f201f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...
localhost: Model loaded!
localhost: Found cached dataset parquet (/root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-0275f923823d6c0b/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
localhost: Found cached dataset parquet (/root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-0275f923823d6c0b/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
localhost: Dataset loaded!
localhost: [2023-04-03 10:50:46,311] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Downloading data files: 100%|██████████| 3/3 [00:00<00:00, 10941.66it/s]
Extracting data files: 100%|██████████| 3/3 [00:00<00:00, 1896.44it/s]
deepspeed-18: Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-bed27f7b4c8f201f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.
Found cached dataset parquet (/root/.cache/huggingface/datasets/parquet/openai_summarize_tldr-bed27f7b4c8f201f/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
deepspeed-18: Dataset loaded!
Downloading builder script: 100%|██████████| 6.27k/6.27k [00:00<00:00, 26.9kB/s]
deepspeed-18: terminate called after throwing an instance of 'std::bad_alloc'
deepspeed-18: what(): std::bad_alloc
deepspeed-18: [2023-04-03 10:51:15,307] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 1231493
deepspeed-18: [2023-04-03 10:51:15,308] [ERROR] [launch.py:324:sigkill_handler] ['/root/mambaforge/envs/trlx_env/bin/python', '-u', 'train_gptj_summarize.py', '--local_rank=0'] exits with return code = -6
pdsh@9a3cd98dd64f: deepspeed-18: ssh exited with exit code 250
Hostfile
localhost slots=1
deepspeed-18 slots=1
Launcher context
deepspeed --hostfile=../../hostfile train_gptj_summarize.py
Docker context
This problem occurs when I run with a docker container, but not when I don't use a container.
The text was updated successfully, but these errors were encountered: