-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
预训练时遇到错误 #9
Comments
使用的卡时A800 |
搞定了是精读的问题 |
请问是什么问题,我也遇到了这个报错 |
是精度的问题,把半精度16相关的设置关了就好,代码注释掉,或者半精度16使用改成False,即可 |
您好,第一次训练大模型,请问是具体改的哪里不,我也同样报错的,我把TrainingArguments里面的bf16=false不行 |
您好,我也是这个问题,请问解决了吗 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
老哥,执行了sh tran.sh pre_tran.py,修改成了单卡模式,为啥会出现下面这个错误,搜了一下好像是“模型或数据未正确移至相应设备:”。
Number of trainable parameters = 1,431,996,416
0%| | 0/5195 [00:00<?, ?it/s]Traceback (most recent call last):
File "/data/hlh/MINI_LLM-main/pre_train.py", line 236, in
trainer.train(
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/transformers/trainer.py", line 1645, in train
return inner_training_loop(
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/transformers/trainer.py", line 2007, in _inner_training_loop
self.optimizer.step()
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/accelerate/optimizer.py", line 145, in step
self.optimizer.step(closure)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/optim/optimizer.py", line 373, in wrapper
out = func(*args, **kwargs)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
ret = func(self, *args, **kwargs)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/optim/adamw.py", line 184, in step
adamw(
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/optim/adamw.py", line 335, in adamw
func(
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/optim/adamw.py", line 509, in _multi_tensor_adamw
grouped_tensors = Optimizer._group_tensors_by_device_and_dtype([
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/optim/optimizer.py", line 397, in _group_tensors_by_device_and_dtype
return _group_tensors_by_device_and_dtype(tensorlistlist, with_indices)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/torch/utils/_foreach_utils.py", line 42, in _group_tensors_by_device_and_dtype
torch._C._group_tensors_by_device_and_dtype(tensorlistlist, with_indices).items()
RuntimeError: Tensors of the same index must be on the same device and the same dtype except
step
tensors that can be CPU and float32 notwithstanding0%| | 0/5195 [00:14<?, ?it/s]
Traceback (most recent call last):
File "/home/alex/miniconda3/envs/ChatGLM2-6b/bin/accelerate", line 8, in
sys.exit(main())
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/accelerate/commands/launch.py", line 986, in launch_command
simple_launcher(args)
File "/home/alex/miniconda3/envs/ChatGLM2-6b/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/alex/miniconda3/envs/ChatGLM2-6b/bin/python', 'pre_train.py']' returned non-zero exit status 1.
The text was updated successfully, but these errors were encountered: