Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

预训练时遇到错误 #9

Open
hulonghua-devin opened this issue Mar 14, 2024 · 6 comments
Open

预训练时遇到错误 #9

hulonghua-devin opened this issue Mar 14, 2024 · 6 comments

Comments

@hulonghua-devin
Copy link

老哥,执行了sh tran.sh pre_tran.py,修改成了单卡模式,为啥会出现下面这个错误,搜了一下好像是“模型或数据未正确移至相应设备:”。
Number of trainable parameters = 1,431,996,416
0%| | 0/5195 [00:00<?, ?it/s]Traceback (most recent call last):
File "/data/hlh/MINI_LLM-main/pre_train.py", line 236, in
trainer.train(
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/transformers/trainer.py", line 1645, in train
return inner_training_loop(
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/transformers/trainer.py", line 2007, in _inner_training_loop
self.optimizer.step()
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/accelerate/optimizer.py", line 145, in step
self.optimizer.step(closure)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/optim/optimizer.py", line 373, in wrapper
out = func(*args, **kwargs)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
ret = func(self, *args, **kwargs)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/optim/adamw.py", line 184, in step
adamw(
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/optim/adamw.py", line 335, in adamw
func(
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/optim/adamw.py", line 509, in _multi_tensor_adamw
grouped_tensors = Optimizer._group_tensors_by_device_and_dtype([
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/optim/optimizer.py", line 397, in _group_tensors_by_device_and_dtype
return _group_tensors_by_device_and_dtype(tensorlistlist, with_indices)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/utils/_foreach_utils.py", line 42, in _group_tensors_by_device_and_dtype
torch._C._group_tensors_by_device_and_dtype(tensorlistlist, with_indices).items()
RuntimeError: Tensors of the same index must be on the same device and the same dtype except step tensors that can be CPU and float32 notwithstanding
0%| | 0/5195 [00:14<?, ?it/s]
Traceback (most recent call last):
File "/home/alex/miniconda3/envs/ChatGLM2-6b/bin/accelerate", line 8, in
sys.exit(main())
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/accelerate/commands/launch.py", line 986, in launch_command
simple_launcher(args)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/alex/miniconda3/envs/ChatGLM2-6b/bin/python', 'pre_train.py']' returned non-zero exit status 1.

@hulonghua-devin
Copy link
Author

使用的卡时A800

@hulonghua-devin
Copy link
Author

搞定了是精读的问题

@razin13545adosjaj
Copy link

请问是什么问题,我也遇到了这个报错

@hulonghua-devin
Copy link
Author

请问是什么问题,我也遇到了这个报错

是精度的问题,把半精度16相关的设置关了就好,代码注释掉,或者半精度16使用改成False,即可

@wendongj
Copy link

wendongj commented Apr 2, 2024

请问是什么问题,我也遇到了这个报错

是精度的问题,把半精度16相关的设置关了就好,代码注释掉,或者半精度16使用改成False,即可

您好,第一次训练大模型,请问是具体改的哪里不,我也同样报错的,我把TrainingArguments里面的bf16=false不行

@xiaochounikuaixiao
Copy link

请问是什么问题,我也遇到了这个报错

是精度的问题,把半精度16相关的设置关了就好,代码注释掉,或者半精度16使用改成False,即可

您好,第一次训练大模型,请问是具体改的哪里不,我也同样报错的,我把TrainingArguments里面的bf16=false不行

您好,我也是这个问题,请问解决了吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants