Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Overflow when unpacking long #364

Closed
nguyenhuuthuat09 opened this issue May 13, 2022 · 6 comments
Closed

RuntimeError: Overflow when unpacking long #364

nguyenhuuthuat09 opened this issue May 13, 2022 · 6 comments
Assignees

Comments

@nguyenhuuthuat09
Copy link

nguyenhuuthuat09 commented May 13, 2022

Environment info

  • Machine: Google Cloud TPU VM version v2-alpha
  • transformers: 4.18.0
  • accelerate: 0.9.0.dev0 (same error happen with 0.8.0.dev0)

Script

I am training a GPT2 model using Pytorch run_clm_no_trainer.py.

Error

Below error happen when model is saving checkpoints. But seem that it only occurs at third or second checkpoint.

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/launch.py", line 55, in __call__
    self.launcher(*args)
  File "/home/nguyenhuuthuat09/gpt2/train_v1.py", line 553, in main
    accelerator.save_state(output_dir)      <---- this is line 564 in the original run_clm_no_trainer.py
  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 799, in save_state
    save_location = save_accelerator_state(
  File "/usr/local/lib/python3.8/dist-packages/accelerate/checkpointing.py", line 105, in save_accelerator_state
    states["xm_seed"] = torch.tensor(xm.get_rng_state())
RuntimeError: Overflow when unpacking long
Exception in device=TPU:0: Overflow when unpacking long

Enviroment variable

  • export XRT_TPU_CONFIG="localservice;0;localhost:51011"
  • I run accelerate config and use accelerate launch to run the code.
  • After the error happen, I tried two below command but it doesn't help.
    export XLA_USE_BF16=1
    export XLA_TENSOR_ALLOCATOR_MAXSIZE=100000000

Releated issue

Thank you for great library!

@sgugger
Copy link
Collaborator

sgugger commented May 13, 2022

Not sure why it's wrapped inside a Tensor in the first place, @muellerzr ?

@nguyenhuuthuat09
Copy link
Author

nguyenhuuthuat09 commented May 13, 2022

Hi, I just tried to change:

states["xm_seed"] = torch.tensor(xm.get_rng_state())
-> states["xm_seed"] = torch.tensor(xm.get_rng_state(), dtype=torch.float32)

and the error seem doesn't happen anymore. I save checkpoints successfully ten times in a row. But I'm not sure that's the proper way to fix it.

@muellerzr muellerzr self-assigned this May 13, 2022
@muellerzr
Copy link
Collaborator

@sgugger you're right, it shouldn't be. Not sure where I saw that happening when I was looking at it, but will put in a fix today

@sgugger
Copy link
Collaborator

sgugger commented May 13, 2022

The seed is an int, not a float @nguyenhuuthuat09, you won't be able to reload that RNG state if you save it as float.

The proper fix is to jsut remove torch.tensor here.

@nguyenhuuthuat09
Copy link
Author

Great! Thank you so much!!!

@scorpiomaj27
Copy link

Another thing that could cause this is if you accidentally added a seed that was too long, for example, pasted it twice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants