`FullModelHFCheckpointer` saved checkpoint isn't compatible with Huggingface `transformers` model loading #2048

vancoykendall · 2024-11-21T21:02:19Z

Loading huggingface transformers models is done with the from_pretrained() method. For pytorch or safetensors checkpoints, this method expects a pytorch_model.bin or model.safetensors file for single file checkpoints. For sharded checkpoints, it expects either a pytorch_model.bin.index.json or model.safetensors.index.json file that maps each weight to the shard file (I think sharded checkpoint files can have arbitrary naming).

Currently, the FullModelHFCheckpointer doesn’t name single file checkpoitns pytorch_model.bin or model.safetensors and doesn't create an index.json file for sharded checkpoints. Thus, saved checkpoints can't be loaded with from_pretrained().

torchtune/torchtune/training/checkpointing/_checkpointer.py

Lines 622 to 639 in e9fd56a

    
           # write the partitioned state dicts to the right checkpoint file 
        
           for cpt_idx, model_state_dict in split_state_dicts.items(): 
        
               if not self._safe_serialization: 
        
                   output_path = Path.joinpath( 
        
                       self._output_dir, f"hf_model_{cpt_idx}_{epoch}" 
        
                   ).with_suffix(".pt") 
        
                   torch.save(model_state_dict, output_path) 
        
               else: 
        
                   output_path = Path.joinpath( 
        
                       self._output_dir, 
        
                       f"model-0{cpt_idx}-of-0{list(split_state_dicts.keys())[-1]}_{epoch}", 
        
                   ).with_suffix(".safetensors") 
        
                   save_file(model_state_dict, output_path, metadata={"format": "pt"}) 
        
               logger.info( 
        
                   "Model checkpoint of size " 
        
                   f"{os.path.getsize(output_path) / 1000**3:.2f} GB " 
        
                   f"saved to {output_path}" 
        
               )

The text was updated successfully, but these errors were encountered:

vancoykendall · 2024-11-21T21:11:20Z

Here's some potentially useful links related to huggingface checkpoint creation
Huggingface documents index file creation here: https://huggingface.co/docs/transformers/v4.46.3/en/big_models#sharded-checkpoints

Code for sharding state dict here: https://github.com/huggingface/huggingface_hub/blob/v0.26.2/src/huggingface_hub/serialization/_torch.py

felipemello1 · 2024-11-22T01:00:38Z

hey @vancoykendall , thank you so much for this issue! I am one of the maintainers and am working on fixing this. @joecummings is also working on refactoring our checkpointing. This will become much easier in the following days.

Meanwhile, to unblock you, we give an example here: https://pytorch.org/torchtune/main/tutorials/e2e_flow.html#using-torchtune-checkpoints-with-other-libraries

You can manually make it .bin (i know, that's not fun) and select the files to keep in the folder. Then from_pretrained will work.

vancoykendall · 2024-11-22T03:04:27Z

Nice thanks! I also just patched the save_checkpoint method locally so I don't have to convert them anymore

# split the state_dict into separate dicts, one for each output checkpoint file
            split_state_dicts: Dict[str, Dict[str, torch.Tensor]] = {}
            total_size = 0
            for key, weight in state_dict[training.MODEL_KEY].items():
                cpt_idx = self._weight_map[key]
                if cpt_idx not in split_state_dicts:
                    split_state_dicts[cpt_idx] = {}
                split_state_dicts[cpt_idx].update({key: weight})
                total_size += weight.numel() * weight.element_size()

            # write the partitioned state dicts to the right checkpoint file
            num_shards = len(split_state_dicts)
            for cpt_idx, model_state_dict in split_state_dicts.items():
                if not self._safe_serialization:
                    shard_name = f"pytorch_model-{int(cpt_idx):05d}-of-{int(num_shards):05d}"
                    output_path = Path.joinpath(
                        self._output_dir, f"{shard_name}_{epoch}"
                    ).with_suffix(".bin")
                    torch.save(model_state_dict, output_path)
                else:
                    shard_name = f"model-{int(cpt_idx):05d}-of-{int(num_shards):05d}_{epoch}"
                    output_path = Path.joinpath(self._output_dir, shard_name).with_suffix(".safetensors")
                    save_file(model_state_dict, output_path, metadata={"format": "pt"})
                logger.info(
                    "Model checkpoint of size "
                    f"{os.path.getsize(output_path) / 1000**3:.2f} GB "
                    f"saved to {output_path}"
                )

            # Save the appropriate index file based on serialization format
            if self._safe_serialization:
                index_path = Path.joinpath(self._output_dir, "model.safetensors.index.json")
                weight_map = {
                    k: f"model-{int(v):05d}-of-{int(num_shards):05d}_{epoch}.safetensors" 
                    for k, v in self._weight_map.items()
                }
            else:
                index_path = Path.joinpath(self._output_dir, "pytorch_model.bin.index.json")
                weight_map = {
                    k: f"pytorch_model-{int(v):05d}-of-{int(num_shards):05d}_{epoch}.bin"
                    for k, v in self._weight_map.items()
                }

            index_data = {
                "metadata": {"total_size": total_size},
                "weight_map": weight_map
            }
            with open(index_path, "w") as f:
                json.dump(index_data, f, indent=2)

joecummings · 2024-11-22T11:39:55Z

@vancoykendall This is awesome! Would you like to open a PR on our repo adding this patch? I think it would definitely benefit our entire community :)

vancoykendall · 2024-11-22T20:08:04Z

Sure I'd be happy to. Although I've realized this current method would overwrite the index.json file each epoch since the index file name can't be modified. I could instead save each epoch checkpoint in a separate subfolder? Any thoughts? @joecummings

felipemello1 · 2024-11-22T20:38:49Z

hey @vancoykendall , i am working on a related issue to handle how we save/load files. Would it be fine if i drive the PR and have you as co-author? I am afraid that my changes may undo/conflict with yours. If so, please send me your email on discord, and i can add you as a co-author in the commit, to make sure you get credit for it

adding as co-author: https://docs.github.com/en/pull-requests/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors
my discord: whynot9753

vancoykendall · 2024-11-22T20:50:00Z

Cool, thanks. I just sent you my email in the discord.

gordicaleksa · 2024-11-26T21:35:03Z

Hey guys, I'm hitting this as well - any progress on this? :)

felipemello1 · 2024-11-26T21:51:06Z

@gordicaleksa, I have a draft, but it is not ready to be used: https://github.com/pytorch/torchtune/pull/2074/files . After merged, the outputdir should be more organized and ready to be used by vllm and HF.

TLDR:
Assuming you are doing full finetuning, you can follow the script shared above by @vancoykendall .

create an empty folder
save the checkpointing as .safetensors or bin. Example here also.
create the "model.safetensors.index.json" file (or pytorch_model.bin.index.json if you saved .bin)
add to this folder the tokenizer (not sure if this is necessary)
now you can pass the folder to .from_pretrained or vllm

Sorry you hit this. We are ironing out our integration with HF/vllm.

gordicaleksa · 2024-11-26T22:01:39Z

I think i did manage to work around it

snippet from @vancoykendall helped speed it up

thanks guys!

i think tokenizer will also be necessary for vLLM, etc.

felipemello1 · 2024-12-06T22:03:32Z

hey folks, PR is merged: #2074

Now it should be much easier to use vllm/huggingface. Instructions are in the pr description

We will update the docs soon. Let us know if you find any issues and thanks for your patience :).

RdoubleA assigned joecummings Nov 21, 2024

RdoubleA added the high-priority label Nov 21, 2024

felipemello1 self-assigned this Nov 22, 2024

felipemello1 mentioned this issue Nov 22, 2024

Integrate Lora fine-tuned model with HF #2025

Closed

This was referenced Dec 2, 2024

Multi GPU timeout on save checkpoint (WorkNCCL, Watchdog, timeout) #2093

Open

Update checkpointing directory -> using vLLM and from_pretrained #2074

Merged

How to convert fine-tuned .pt to huggingface .safetensors #2118

Closed

felipemello1 closed this as completed Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`FullModelHFCheckpointer` saved checkpoint isn't compatible with Huggingface `transformers` model loading #2048

`FullModelHFCheckpointer` saved checkpoint isn't compatible with Huggingface `transformers` model loading #2048

vancoykendall commented Nov 21, 2024

vancoykendall commented Nov 21, 2024

felipemello1 commented Nov 22, 2024 •

edited

Loading

vancoykendall commented Nov 22, 2024

joecummings commented Nov 22, 2024

vancoykendall commented Nov 22, 2024

felipemello1 commented Nov 22, 2024

vancoykendall commented Nov 22, 2024

gordicaleksa commented Nov 26, 2024

felipemello1 commented Nov 26, 2024 •

edited

Loading

gordicaleksa commented Nov 26, 2024

felipemello1 commented Dec 6, 2024 •

edited

Loading

FullModelHFCheckpointer saved checkpoint isn't compatible with Huggingface transformers model loading #2048

FullModelHFCheckpointer saved checkpoint isn't compatible with Huggingface transformers model loading #2048

Comments

vancoykendall commented Nov 21, 2024

vancoykendall commented Nov 21, 2024

felipemello1 commented Nov 22, 2024 • edited Loading

vancoykendall commented Nov 22, 2024

joecummings commented Nov 22, 2024

vancoykendall commented Nov 22, 2024

felipemello1 commented Nov 22, 2024

vancoykendall commented Nov 22, 2024

gordicaleksa commented Nov 26, 2024

felipemello1 commented Nov 26, 2024 • edited Loading

gordicaleksa commented Nov 26, 2024

felipemello1 commented Dec 6, 2024 • edited Loading

`FullModelHFCheckpointer` saved checkpoint isn't compatible with Huggingface `transformers` model loading #2048

`FullModelHFCheckpointer` saved checkpoint isn't compatible with Huggingface `transformers` model loading #2048

felipemello1 commented Nov 22, 2024 •

edited

Loading

felipemello1 commented Nov 26, 2024 •

edited

Loading

felipemello1 commented Dec 6, 2024 •

edited

Loading