Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torchao opt resuming from ckpt requires weights_only=False #1885

Open
felipemello1 opened this issue Mar 13, 2025 · 2 comments
Open

Torchao opt resuming from ckpt requires weights_only=False #1885

felipemello1 opened this issue Mar 13, 2025 · 2 comments

Comments

@felipemello1
Copy link

In torchtune, cant resume from checkpoint when using torchao:

  File "/data/users/felipemello/torchtune/torchtune/training/checkpointing/_utils.py", line 249, in safe_torch_load
    state_dict = torch.load(
                 ^^^^^^^^^^^
  File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torch/serialization.py", line 1486, in load
    raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. 
        (1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
        (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
        WeightsUnpickler error: Unsupported global: GLOBAL torchao.prototype.low_bit_optim.subclass_8bit.OptimState8bit was not an allowed global by default. Please use `torch.serialization.add_safe_globals([OptimState8bit])` or the `torch.serialization.safe_globals([OptimState8bit])` context manager to allowlist this global if you trust this class/function.

to reproduce:

tune download meta-llama/Llama-3.2-1B-Instruct --output-dir /tmp/Llama-3.2-1B-Instruct --ignore-patterns "original/consolidated.00.pth"
tune run full_finetune_single_device --config llama3_2/1B_full_single_device epochs=2 max_steps_per_epoch=20 optimizer=torchao.prototype.low_bit_optim.AdamW8bit
tune run full_finetune_single_device --config llama3_2/1B_full_single_device epochs=2 max_steps_per_epoch=20 optimizer=torchao.prototype.low_bit_optim.AdamW8bit resume_from_checkpoint=True checkpointer.checkpoint_files=["epoch_0/model-00001-of-00001.safetensors"] 
@felipemello1 felipemello1 changed the title Torchao opt resuming from ckpt requires weights_only=False? Torchao opt resuming from ckpt requires weights_only=False Mar 13, 2025
@supriyar
Copy link
Contributor

@gau-nernst any thoughts on what might be the issue?

@gau-nernst
Copy link
Collaborator

@felipemello1 What is your torchao version? The subclass should have been added to safe globals list quite some time ago

if TORCH_VERSION_AT_LEAST_2_5:
from torch.serialization import add_safe_globals
add_safe_globals([OptimState8bit])

#1228

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants