Torchao opt resuming from ckpt requires `weights_only=False` #1885

felipemello1 · 2025-03-13T19:31:26Z

In torchtune, cant resume from checkpoint when using torchao:

  File "/data/users/felipemello/torchtune/torchtune/training/checkpointing/_utils.py", line 249, in safe_torch_load
    state_dict = torch.load(
                 ^^^^^^^^^^^
  File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torch/serialization.py", line 1486, in load
    raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. 
        (1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
        (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
        WeightsUnpickler error: Unsupported global: GLOBAL torchao.prototype.low_bit_optim.subclass_8bit.OptimState8bit was not an allowed global by default. Please use `torch.serialization.add_safe_globals([OptimState8bit])` or the `torch.serialization.safe_globals([OptimState8bit])` context manager to allowlist this global if you trust this class/function.

to reproduce:

tune download meta-llama/Llama-3.2-1B-Instruct --output-dir /tmp/Llama-3.2-1B-Instruct --ignore-patterns "original/consolidated.00.pth"

tune run full_finetune_single_device --config llama3_2/1B_full_single_device epochs=2 max_steps_per_epoch=20 optimizer=torchao.prototype.low_bit_optim.AdamW8bit

tune run full_finetune_single_device --config llama3_2/1B_full_single_device epochs=2 max_steps_per_epoch=20 optimizer=torchao.prototype.low_bit_optim.AdamW8bit resume_from_checkpoint=True checkpointer.checkpoint_files=["epoch_0/model-00001-of-00001.safetensors"]

The text was updated successfully, but these errors were encountered:

supriyar · 2025-03-14T21:20:48Z

@gau-nernst any thoughts on what might be the issue?

gau-nernst · 2025-03-15T01:41:59Z

@felipemello1 What is your torchao version? The subclass should have been added to safe globals list quite some time ago

ao/torchao/optim/subclass_8bit.py

Lines 223 to 226 in d258a11

    
           if TORCH_VERSION_AT_LEAST_2_5: 
        
               from torch.serialization import add_safe_globals 
        
               add_safe_globals([OptimState8bit])

#1228

felipemello1 changed the title ~~Torchao opt resuming from ckpt requires weights_only=False?~~ Torchao opt resuming from ckpt requires weights_only=False Mar 13, 2025

supriyar added the optimizer label Mar 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torchao opt resuming from ckpt requires `weights_only=False` #1885

Torchao opt resuming from ckpt requires `weights_only=False` #1885

felipemello1 commented Mar 13, 2025

supriyar commented Mar 14, 2025

gau-nernst commented Mar 15, 2025

Torchao opt resuming from ckpt requires weights_only=False #1885

Torchao opt resuming from ckpt requires weights_only=False #1885

Comments

felipemello1 commented Mar 13, 2025

supriyar commented Mar 14, 2025

gau-nernst commented Mar 15, 2025

Torchao opt resuming from ckpt requires `weights_only=False` #1885

Torchao opt resuming from ckpt requires `weights_only=False` #1885