Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mplug-owl3 doing full model sft bug report #2158

Closed
goodstudent9 opened this issue Sep 29, 2024 · 6 comments
Closed

mplug-owl3 doing full model sft bug report #2158

goodstudent9 opened this issue Sep 29, 2024 · 6 comments

Comments

@goodstudent9
Copy link

goodstudent9 commented Sep 29, 2024

Describe the bug
There are 2 bugs! But that seems like these bugs only occur when doing full parameter finetune. When I use Lora, that seems like no such bugs.
1. If I train 1 epoch and do evaluation in training time after one val process finished, the coming training procedure will failed. The error message is follows:
image

[rank0]: Traceback (most recent call last):                                                                                                                                                            
[rank0]:   File "/data1/myself/Pretrain/ms-swift/swift/cli/sft.py", line 7, in <module>                                                                                                        
[rank0]:     sft_main()                                                                                                                                                                                
[rank0]:   File "/data1/myself/Pretrain/ms-swift/swift/utils/run_utils.py", line 32, in x_main                                                                                                 
[rank0]:     result = llm_x(args, **kwargs)                                                                                                                                                            
[rank0]:   File "/data1/myself/Pretrain/ms-swift/swift/llm/sft.py", line 513, in llm_sft                                                                                                       
[rank0]:     return trainer_train(args, model, template, train_dataset, val_dataset, callbacks=callbacks, msg=msg)                                                                                     
[rank0]:   File "/data1/myself/Pretrain/ms-swift/swift/llm/sft.py", line 462, in trainer_train                                                                                                 
[rank0]:     trainer.train(training_args.resume_from_checkpoint)                                                                                                                                       
[rank0]:   File "/data1/myself/Pretrain/ms-swift/swift/trainers/mixin.py", line 424, in train                                                                                                  
[rank0]:     res = super().train(resume_from_checkpoint, *args, **kwargs)                                                                                                                              
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/site-packages/transformers/trainer.py", line 1938, in train                                                                     
[rank0]:     return inner_training_loop(                                                                                                    
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/site-packages/transformers/trainer.py", line 2356, in _inner_training_loop                                   11:14:54 [485/1191]
[rank0]:     self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)                                                                                              
[rank0]:   File "/data1/myself/Pretrain/ms-swift/swift/trainers/mixin.py", line 500, in _maybe_log_save_evaluate                                                                               
[rank0]:     super()._maybe_log_save_evaluate(tr_loss, *args, **kwargs)                                                                                                                                
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/site-packages/transformers/trainer.py", line 2807, in _maybe_log_save_evaluate                                                  
[rank0]:     self._save_checkpoint(model, trial, metrics=metrics)                                                                                                                                      
[rank0]:   File "/data1/myself/Pretrain/ms-swift/swift/trainers/mixin.py", line 322, in _save_checkpoint                                                                                       
[rank0]:     result = super()._save_checkpoint(model, trial, metrics)                                                                                                                                  
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/site-packages/transformers/trainer.py", line 2886, in _save_checkpoint                                                          
[rank0]:     self.save_model(output_dir, _internal_call=True)                                                                                                                                          
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/site-packages/transformers/trainer.py", line 3441, in save_model                                                                
[rank0]:     self._save(output_dir, state_dict=state_dict)                                                                                                                                             
[rank0]:   File "/data1/myself/Pretrain/ms-swift/swift/trainers/mixin.py", line 285, in _save                                                                                                  
[rank0]:     self.tokenizer.processor.save_pretrained(output_dir)                                                                                                                                      
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/site-packages/transformers/processing_utils.py", line 488, in save_pretrained                                                   
[rank0]:     attribute.save_pretrained(save_directory)                                                                                                                                                 
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/site-packages/transformers/image_processing_base.py", line 257, in save_pretrained                                              
[rank0]:     self.to_json_file(output_image_processor_file)                                                                                                                                            
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/site-packages/transformers/image_processing_base.py", line 496, in to_json_file                                                 
[rank0]:     writer.write(self.to_json_string())                                                                                                                                                       
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/site-packages/transformers/image_processing_base.py", line 485, in to_json_string                                               
[rank0]:     return json.dumps(dictionary, indent=2, sort_keys=True) + "\n"                                                                                                                            
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/json/__init__.py", line 234, in dumps                                                                                           
[rank0]:     return cls(                                                                                                                                                                               
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/json/encoder.py", line 201, in encode                                                                                           
[rank0]:     chunks = list(chunks)                                                                                                                                                                     
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/json/encoder.py", line 431, in _iterencode                                                                                      
[rank0]:     yield from _iterencode_dict(o, _current_indent_level)                                                                                                                                     
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/json/encoder.py", line 405, in _iterencode_dict                                                                                 
[rank0]:     yield from chunks                                                                                                                                                                         
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/json/encoder.py", line 438, in _iterencode                                                                                      
[rank0]:     o = _default(o)                                                                                                                                                                           
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/json/encoder.py", line 179, in default                                                                                          
[rank0]:     raise TypeError(f'Object of type {o.__class__.__name__} '                                                                                                                                 
[rank0]: TypeError: Object of type function is not JSON serializable   

2. If I train the model from checkpoint, which means I use the args: --resume_from_checkpoint output/mplug-owl3-7b-chat/v34-20240929-110829/checkpoint-2 \ The error will occur as follows.

[rank3]: Traceback (most recent call last):                                                                                                                                         11:22:39 [151/1191]
[rank3]:   File "/data1/myself/Pretrain/ms-swift/swift/cli/sft.py", line 7, in <module>                                                                                                        
[rank3]:     sft_main()                                                                                                                                                                                
[rank3]:   File "/data1/myself/Pretrain/ms-swift/swift/utils/run_utils.py", line 32, in x_main                                                                                                 
[rank3]:     result = llm_x(args, **kwargs)                                                                                                                                                            
[rank3]:   File "/data1/myself/Pretrain/ms-swift/swift/llm/sft.py", line 511, in llm_sft                                                                                                       
[rank3]:     model, template, callbacks = prepare_model_template_train(args, msg)                                                                                                                      
[rank3]:   File "/data1/myself/Pretrain/ms-swift/swift/llm/sft.py", line 202, in prepare_model_template_train                                                                                  
[rank3]:     model, tokenizer = get_model_tokenizer(                                                                                                                                                   
[rank3]:   File "/data1/myself/Pretrain/ms-swift/swift/llm/utils/model.py", line 6635, in get_model_tokenizer                                                                                  
[rank3]:     model, tokenizer = get_function(model_dir, torch_dtype, model_kwargs, load_model, **kwargs)                                                                                               
[rank3]:   File "/data1/myself/Pretrain/ms-swift/swift/llm/utils/model.py", line 2717, in get_model_tokenizer_mplug_owl3                                                                       
[rank3]:     model, tokenizer = get_model_tokenizer_with_flash_attn(model_dir, torch_dtype, model_kwargs, load_model, **kwargs)                                                                        
[rank3]:   File "/data1/myself/Pretrain/ms-swift/swift/llm/utils/model.py", line 2699, in get_model_tokenizer_with_flash_attn                                                                  
[rank3]:     return get_model_tokenizer_from_repo(                                                                                                                                                     
[rank3]:   File "/data1/myself/Pretrain/ms-swift/swift/llm/utils/model.py", line 957, in get_model_tokenizer_from_repo                                                                         
[rank3]:     tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)                                                                                                              
[rank3]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/site-packages/modelscope/utils/hf_util.py", line 114, in from_pretrained                                                        
[rank3]:     module_obj = module_class.from_pretrained(model_dir, *model_args,                                                                                                                         
[rank3]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 926, in from_pretrained
[rank3]:     raise ValueError(
[rank3]: ValueError: Unrecognized configuration class <class 'transformers_modules.checkpoint-2.configuration_mplugowl3.mPLUGOwl3Config'> to build an AutoTokenizer.
[rank3]: Model type should be one of AlbertConfig, AlignConfig, BarkConfig, BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, Blenderb
otSmallConfig, BlipConfig, Blip2Config, BloomConfig, BridgeTowerConfig, BrosConfig, CamembertConfig, CanineConfig, ChameleonConfig, ChineseCLIPConfig, ClapConfig, CLIPConfig, CLIPSegConfig, ClvpConfi
g, LlamaConfig, CodeGenConfig, CohereConfig, ConvBertConfig, CpmAntConfig, CTRLConfig, Data2VecAudioConfig, Data2VecTextConfig, DbrxConfig, DebertaConfig, DebertaV2Config, DistilBertConfig, DPRConfig
, ElectraConfig, ErnieConfig, ErnieMConfig, EsmConfig, FalconConfig, FastSpeech2ConformerConfig, FlaubertConfig, FNetConfig, FSMTConfig, FunnelConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config
, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, GPTSanJapaneseConfig, GroundingDinoConfig, GroupViTConfig, HubertConfig, IBertConfig, IdeficsConfig, Id
efics2Config, InstructBlipConfig, InstructBlipVideoConfig, JambaConfig, JetMoeConfig, JukeboxConfig, Kosmos2Config, LayoutLMConfig, LayoutLMv2Config, LayoutLMv3Config, LEDConfig, LiltConfig, LlamaCon
fig, LlavaConfig, LlavaNextVideoConfig, LlavaNextConfig, LongformerConfig, LongT5Config, LukeConfig, LxmertConfig, M2M100Config, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, Mega
tronBertConfig, MgpstrConfig, MistralConfig, MixtralConfig, MobileBertConfig, MPNetConfig, MptConfig, MraConfig, MT5Config, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NezhaConfig, NllbMoeConfig
, NystromformerConfig, OlmoConfig, OneFormerConfig, OpenAIGPTConfig, OPTConfig, Owlv2Config, OwlViTConfig, PaliGemmaConfig, PegasusConfig, PegasusXConfig, PerceiverConfig, PersimmonConfig, PhiConfig,
 Phi3Config, Pix2StructConfig, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RagConfig, RealmConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RetriBertConfig
, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, SeamlessM4TConfig, SeamlessM4Tv2Config, SiglipConfig, Speech2TextConfig, Speech2Text2Config, SpeechT5Config, Spl
interConfig, SqueezeBertConfig, StableLmConfig, Starcoder2Config, SwitchTransformersConfig, T5Config, TapasConfig, TransfoXLConfig, TvpConfig, UdopConfig, UMT5Config, VideoLlavaConfig, ViltConfig, Vi
pLlavaConfig, VisualBertConfig, VitsConfig, Wav2Vec2Config, Wav2Vec2BertConfig, Wav2Vec2ConformerConfig, WhisperConfig, XCLIPConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMR
obertaXLConfig, XLNetConfig, XmodConfig, YosoConfig.
@goodstudent9
Copy link
Author

download

https://mega.co.nz/#!qq4nATTK!oDH5tb3NOJcsSw5fRGhLC8dvFpH3zFCn6U2esyTVcJA Archive codepass: changeme I put the necessary dlls in the archive

Thank you for your reply!
But I don't understand your reply. Because that seems like your solution doesn't solve this bug?

@Jintao-Huang
Copy link
Collaborator

Jintao-Huang commented Sep 29, 2024

Are you using the model from Modelscope or from Hugging Face?

@modelscope modelscope deleted a comment Sep 29, 2024
@goodstudent9
Copy link
Author

goodstudent9 commented Sep 29, 2024 via email

@goodstudent9
Copy link
Author

goodstudent9 commented Sep 29, 2024 via email

@LukeForeverYoung
Copy link
Contributor

For the first issue, could you please pull the latest repository from Hugging Face and try again? I have added a function named to_dict to the image processor.

@goodstudent9
Copy link
Author

For the first issue, could you please pull the latest repository from Hugging Face and try again? I have added a function named to_dict to the image processor.

When I update the image_processing_mplugowl3.py from the huggingface latest repo and set USE_HF=1, all 2 Problems are solved!

Thank you so much for your quick and valuable help!
Best wishes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants