RuntimeError: Overflow when unpacking long #364

nguyenhuuthuat09 · 2022-05-13T05:27:44Z

Environment info

Machine: Google Cloud TPU VM version v2-alpha
transformers: 4.18.0
accelerate: 0.9.0.dev0 (same error happen with 0.8.0.dev0)

Script

I am training a GPT2 model using Pytorch run_clm_no_trainer.py.

Error

Below error happen when model is saving checkpoints. But seem that it only occurs at third or second checkpoint.

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
    fn(gindex, *args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/launch.py", line 55, in __call__
    self.launcher(*args)
  File "/home/nguyenhuuthuat09/gpt2/train_v1.py", line 553, in main
    accelerator.save_state(output_dir)      <---- this is line 564 in the original run_clm_no_trainer.py
  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 799, in save_state
    save_location = save_accelerator_state(
  File "/usr/local/lib/python3.8/dist-packages/accelerate/checkpointing.py", line 105, in save_accelerator_state
    states["xm_seed"] = torch.tensor(xm.get_rng_state())
RuntimeError: Overflow when unpacking long
Exception in device=TPU:0: Overflow when unpacking long

Enviroment variable

export XRT_TPU_CONFIG="localservice;0;localhost:51011"
I run accelerate config and use accelerate launch to run the code.
After the error happen, I tried two below command but it doesn't help.
export XLA_USE_BF16=1
export XLA_TENSOR_ALLOCATOR_MAXSIZE=100000000

Releated issue

I think this issue might be related to: RuntimeError: Overflow when unpacking long transformers#10212
I think the problem is on this line: https://github.com/huggingface/accelerate/blob/main/src/accelerate/checkpointing.py#L105
I guess states["xm_seed"] = torch.tensor(xm.get_rng_state()) -> states["xm_seed"] = torch.tensor(xm.get_rng_state(), dtype=torch.float32) may help?

Thank you for great library!

The text was updated successfully, but these errors were encountered:

sgugger · 2022-05-13T14:06:41Z

Not sure why it's wrapped inside a Tensor in the first place, @muellerzr ?

nguyenhuuthuat09 · 2022-05-13T14:13:36Z

Hi, I just tried to change:

states["xm_seed"] = torch.tensor(xm.get_rng_state())
-> states["xm_seed"] = torch.tensor(xm.get_rng_state(), dtype=torch.float32)

and the error seem doesn't happen anymore. I save checkpoints successfully ten times in a row. But I'm not sure that's the proper way to fix it.

muellerzr · 2022-05-13T14:21:32Z

@sgugger you're right, it shouldn't be. Not sure where I saw that happening when I was looking at it, but will put in a fix today

sgugger · 2022-05-13T14:22:19Z

The seed is an int, not a float @nguyenhuuthuat09, you won't be able to reload that RNG state if you save it as float.

The proper fix is to jsut remove torch.tensor here.

nguyenhuuthuat09 · 2022-05-13T14:25:23Z

Great! Thank you so much!!!

scorpiomaj27 · 2023-07-13T14:08:30Z

Another thing that could cause this is if you accidentally added a seed that was too long, for example, pasted it twice.

muellerzr self-assigned this May 13, 2022

nguyenhuuthuat09 closed this as completed May 13, 2022

muellerzr mentioned this issue May 13, 2022

Remove tensor call in save_state for XLA #365

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Overflow when unpacking long #364

RuntimeError: Overflow when unpacking long #364

nguyenhuuthuat09 commented May 13, 2022 •

edited

Loading

sgugger commented May 13, 2022

nguyenhuuthuat09 commented May 13, 2022 •

edited

Loading

muellerzr commented May 13, 2022

sgugger commented May 13, 2022

nguyenhuuthuat09 commented May 13, 2022

scorpiomaj27 commented Jul 13, 2023

RuntimeError: Overflow when unpacking long #364

RuntimeError: Overflow when unpacking long #364

Comments

nguyenhuuthuat09 commented May 13, 2022 • edited Loading

Environment info

Script

Error

Enviroment variable

Releated issue

sgugger commented May 13, 2022

nguyenhuuthuat09 commented May 13, 2022 • edited Loading

muellerzr commented May 13, 2022

sgugger commented May 13, 2022

nguyenhuuthuat09 commented May 13, 2022

scorpiomaj27 commented Jul 13, 2023

nguyenhuuthuat09 commented May 13, 2022 •

edited

Loading

nguyenhuuthuat09 commented May 13, 2022 •

edited

Loading