-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
预训练到特定step出现OutOfMemoryError #23
Comments
在pre_train里面的tokneizer也要加上truncation,虽然在数据预处理的时候最大长度为512,由于tokenizer的原因,在tokenizer之后最大长度不一定只有512,所以会爆显存,在tokenizer的时候加上truncation就可以了
|
上面说的是对的 这个是我实现上忽略的一点 我之后改一下 感谢上面同学的解答 |
已经把tokenizer截取的代码合并到最新版本 |
这样直接截断内容,对预训练有什么不好的影响吗,是不是将超过长度的,分成多个词条来训练会更好呢? |
我下载的代码里 truncation=True 但是仍然会 预训练一段时间后 显存突然异常升高,然后导致溢出。 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
重试多次,预训练均在到特定step出现OutOfMemoryError
61%|██████ | 12840/21167 [14:00:18<8:40:59, 3.75s/it]
61%|██████ | 12841/21167 [14:00:21<8:33:50, 3.70s/it]
61%|██████ | 12842/21167 [14:00:25<8:27:47, 3.66s/it]
61%|██████ | 12843/21167 [14:00:28<8:25:30, 3.64s/it]
61%|██████ | 12844/21167 [14:00:32<8:25:06, 3.64s/it]
61%|██████ | 12845/21167 [14:00:36<8:24:13, 3.64s/it]
61%|██████ | 12846/21167 [14:00:39<8:23:55, 3.63s/it]
61%|██████ | 12847/21167 [14:00:43<8:20:11, 3.61s/it]
61%|██████ | 12848/21167 [14:00:46<8:14:37, 3.57s/it]
61%|██████ | 12849/21167 [14:00:50<8:12:50, 3.56s/it]
61%|██████ | 12850/21167 [14:00:53<8:13:45, 3.56s/it]Traceback (most recent call last):
File "/home/tiger/MINI_LLM/pre_train.py", line 262, in
trainer.train( #'model_save/pre/checkpoint-3400'
File "/home/tiger/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/home/tiger/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1854, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/tiger/.local/lib/python3.9/site-packages/transformers/trainer.py", line 2735, in training_step
loss = self.compute_loss(model, inputs)
File "/home/tiger/.local/lib/python3.9/site-packages/transformers/trainer.py", line 2758, in compute_loss
outputs = model(**inputs)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/utils/operations.py", line 817, in forward
return model_forward(*args, **kwargs)
File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/utils/operations.py", line 805, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/utils/operations.py", line 784, in convert_to_fp32
return recursively_apply(_convert_to_fp32, tensor, test_type=_is_fp16_bf16_tensor)
File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/utils/operations.py", line 127, in recursively_apply
{
File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/utils/operations.py", line 128, in
k: recursively_apply(
File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/utils/operations.py", line 135, in recursively_apply
return func(data, *args, **kwargs)
File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/utils/operations.py", line 779, in _convert_to_fp32
return tensor.float()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.04 GiB. GPU 7 has a total capacity of 79.35 GiB of which 8.27 GiB is free. Process 2987497 has 71.07 GiB memory in use. Of the allocated memory 65.19 GiB is allocated by PyTorch, and 3.39 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
数据统一截512最大长度,batch_size设置成16,gradient_accumulation_steps设置成8,刚开始启动训练时显存是够的
![Uploading image.png…]()
The text was updated successfully, but these errors were encountered: