mplug-owl3 doing full model sft bug report #2158

goodstudent9 · 2024-09-29T03:38:12Z

Describe the bug
There are 2 bugs! But that seems like these bugs only occur when doing full parameter finetune. When I use Lora, that seems like no such bugs.
1. If I train 1 epoch and do evaluation in training time after one val process finished, the coming training procedure will failed. The error message is follows:

[rank0]: Traceback (most recent call last):                                                                                                                                                            
[rank0]:   File "/data1/myself/Pretrain/ms-swift/swift/cli/sft.py", line 7, in <module>                                                                                                        
[rank0]:     sft_main()                                                                                                                                                                                
[rank0]:   File "/data1/myself/Pretrain/ms-swift/swift/utils/run_utils.py", line 32, in x_main                                                                                                 
[rank0]:     result = llm_x(args, **kwargs)                                                                                                                                                            
[rank0]:   File "/data1/myself/Pretrain/ms-swift/swift/llm/sft.py", line 513, in llm_sft                                                                                                       
[rank0]:     return trainer_train(args, model, template, train_dataset, val_dataset, callbacks=callbacks, msg=msg)                                                                                     
[rank0]:   File "/data1/myself/Pretrain/ms-swift/swift/llm/sft.py", line 462, in trainer_train                                                                                                 
[rank0]:     trainer.train(training_args.resume_from_checkpoint)                                                                                                                                       
[rank0]:   File "/data1/myself/Pretrain/ms-swift/swift/trainers/mixin.py", line 424, in train                                                                                                  
[rank0]:     res = super().train(resume_from_checkpoint, *args, **kwargs)                                                                                                                              
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/site-packages/transformers/trainer.py", line 1938, in train                                                                     
[rank0]:     return inner_training_loop(                                                                                                    
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/site-packages/transformers/trainer.py", line 2356, in _inner_training_loop                                   11:14:54 [485/1191]
[rank0]:     self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)                                                                                              
[rank0]:   File "/data1/myself/Pretrain/ms-swift/swift/trainers/mixin.py", line 500, in _maybe_log_save_evaluate                                                                               
[rank0]:     super()._maybe_log_save_evaluate(tr_loss, *args, **kwargs)                                                                                                                                
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/site-packages/transformers/trainer.py", line 2807, in _maybe_log_save_evaluate                                                  
[rank0]:     self._save_checkpoint(model, trial, metrics=metrics)                                                                                                                                      
[rank0]:   File "/data1/myself/Pretrain/ms-swift/swift/trainers/mixin.py", line 322, in _save_checkpoint                                                                                       
[rank0]:     result = super()._save_checkpoint(model, trial, metrics)                                                                                                                                  
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/site-packages/transformers/trainer.py", line 2886, in _save_checkpoint                                                          
[rank0]:     self.save_model(output_dir, _internal_call=True)                                                                                                                                          
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/site-packages/transformers/trainer.py", line 3441, in save_model                                                                
[rank0]:     self._save(output_dir, state_dict=state_dict)                                                                                                                                             
[rank0]:   File "/data1/myself/Pretrain/ms-swift/swift/trainers/mixin.py", line 285, in _save                                                                                                  
[rank0]:     self.tokenizer.processor.save_pretrained(output_dir)                                                                                                                                      
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/site-packages/transformers/processing_utils.py", line 488, in save_pretrained                                                   
[rank0]:     attribute.save_pretrained(save_directory)                                                                                                                                                 
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/site-packages/transformers/image_processing_base.py", line 257, in save_pretrained                                              
[rank0]:     self.to_json_file(output_image_processor_file)                                                                                                                                            
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/site-packages/transformers/image_processing_base.py", line 496, in to_json_file                                                 
[rank0]:     writer.write(self.to_json_string())                                                                                                                                                       
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/site-packages/transformers/image_processing_base.py", line 485, in to_json_string                                               
[rank0]:     return json.dumps(dictionary, indent=2, sort_keys=True) + "\n"                                                                                                                            
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/json/__init__.py", line 234, in dumps                                                                                           
[rank0]:     return cls(                                                                                                                                                                               
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/json/encoder.py", line 201, in encode                                                                                           
[rank0]:     chunks = list(chunks)                                                                                                                                                                     
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/json/encoder.py", line 431, in _iterencode                                                                                      
[rank0]:     yield from _iterencode_dict(o, _current_indent_level)                                                                                                                                     
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/json/encoder.py", line 405, in _iterencode_dict                                                                                 
[rank0]:     yield from chunks                                                                                                                                                                         
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/json/encoder.py", line 438, in _iterencode                                                                                      
[rank0]:     o = _default(o)                                                                                                                                                                           
[rank0]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/json/encoder.py", line 179, in default                                                                                          
[rank0]:     raise TypeError(f'Object of type {o.__class__.__name__} '                                                                                                                                 
[rank0]: TypeError: Object of type function is not JSON serializable

2. If I train the model from checkpoint, which means I use the args: --resume_from_checkpoint output/mplug-owl3-7b-chat/v34-20240929-110829/checkpoint-2 \ The error will occur as follows.

[rank3]: Traceback (most recent call last):                                                                                                                                         11:22:39 [151/1191]
[rank3]:   File "/data1/myself/Pretrain/ms-swift/swift/cli/sft.py", line 7, in <module>                                                                                                        
[rank3]:     sft_main()                                                                                                                                                                                
[rank3]:   File "/data1/myself/Pretrain/ms-swift/swift/utils/run_utils.py", line 32, in x_main                                                                                                 
[rank3]:     result = llm_x(args, **kwargs)                                                                                                                                                            
[rank3]:   File "/data1/myself/Pretrain/ms-swift/swift/llm/sft.py", line 511, in llm_sft                                                                                                       
[rank3]:     model, template, callbacks = prepare_model_template_train(args, msg)                                                                                                                      
[rank3]:   File "/data1/myself/Pretrain/ms-swift/swift/llm/sft.py", line 202, in prepare_model_template_train                                                                                  
[rank3]:     model, tokenizer = get_model_tokenizer(                                                                                                                                                   
[rank3]:   File "/data1/myself/Pretrain/ms-swift/swift/llm/utils/model.py", line 6635, in get_model_tokenizer                                                                                  
[rank3]:     model, tokenizer = get_function(model_dir, torch_dtype, model_kwargs, load_model, **kwargs)                                                                                               
[rank3]:   File "/data1/myself/Pretrain/ms-swift/swift/llm/utils/model.py", line 2717, in get_model_tokenizer_mplug_owl3                                                                       
[rank3]:     model, tokenizer = get_model_tokenizer_with_flash_attn(model_dir, torch_dtype, model_kwargs, load_model, **kwargs)                                                                        
[rank3]:   File "/data1/myself/Pretrain/ms-swift/swift/llm/utils/model.py", line 2699, in get_model_tokenizer_with_flash_attn                                                                  
[rank3]:     return get_model_tokenizer_from_repo(                                                                                                                                                     
[rank3]:   File "/data1/myself/Pretrain/ms-swift/swift/llm/utils/model.py", line 957, in get_model_tokenizer_from_repo                                                                         
[rank3]:     tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)                                                                                                              
[rank3]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/site-packages/modelscope/utils/hf_util.py", line 114, in from_pretrained                                                        
[rank3]:     module_obj = module_class.from_pretrained(model_dir, *model_args,                                                                                                                         
[rank3]:   File "/data1/myself/miniconda3/envs/owl3/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 926, in from_pretrained
[rank3]:     raise ValueError(
[rank3]: ValueError: Unrecognized configuration class <class 'transformers_modules.checkpoint-2.configuration_mplugowl3.mPLUGOwl3Config'> to build an AutoTokenizer.
[rank3]: Model type should be one of AlbertConfig, AlignConfig, BarkConfig, BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, Blenderb
otSmallConfig, BlipConfig, Blip2Config, BloomConfig, BridgeTowerConfig, BrosConfig, CamembertConfig, CanineConfig, ChameleonConfig, ChineseCLIPConfig, ClapConfig, CLIPConfig, CLIPSegConfig, ClvpConfi
g, LlamaConfig, CodeGenConfig, CohereConfig, ConvBertConfig, CpmAntConfig, CTRLConfig, Data2VecAudioConfig, Data2VecTextConfig, DbrxConfig, DebertaConfig, DebertaV2Config, DistilBertConfig, DPRConfig
, ElectraConfig, ErnieConfig, ErnieMConfig, EsmConfig, FalconConfig, FastSpeech2ConformerConfig, FlaubertConfig, FNetConfig, FSMTConfig, FunnelConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config
, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, GPTSanJapaneseConfig, GroundingDinoConfig, GroupViTConfig, HubertConfig, IBertConfig, IdeficsConfig, Id
efics2Config, InstructBlipConfig, InstructBlipVideoConfig, JambaConfig, JetMoeConfig, JukeboxConfig, Kosmos2Config, LayoutLMConfig, LayoutLMv2Config, LayoutLMv3Config, LEDConfig, LiltConfig, LlamaCon
fig, LlavaConfig, LlavaNextVideoConfig, LlavaNextConfig, LongformerConfig, LongT5Config, LukeConfig, LxmertConfig, M2M100Config, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, Mega
tronBertConfig, MgpstrConfig, MistralConfig, MixtralConfig, MobileBertConfig, MPNetConfig, MptConfig, MraConfig, MT5Config, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NezhaConfig, NllbMoeConfig
, NystromformerConfig, OlmoConfig, OneFormerConfig, OpenAIGPTConfig, OPTConfig, Owlv2Config, OwlViTConfig, PaliGemmaConfig, PegasusConfig, PegasusXConfig, PerceiverConfig, PersimmonConfig, PhiConfig,
 Phi3Config, Pix2StructConfig, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RagConfig, RealmConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RetriBertConfig
, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, SeamlessM4TConfig, SeamlessM4Tv2Config, SiglipConfig, Speech2TextConfig, Speech2Text2Config, SpeechT5Config, Spl
interConfig, SqueezeBertConfig, StableLmConfig, Starcoder2Config, SwitchTransformersConfig, T5Config, TapasConfig, TransfoXLConfig, TvpConfig, UdopConfig, UMT5Config, VideoLlavaConfig, ViltConfig, Vi
pLlavaConfig, VisualBertConfig, VitsConfig, Wav2Vec2Config, Wav2Vec2BertConfig, Wav2Vec2ConformerConfig, WhisperConfig, XCLIPConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMR
obertaXLConfig, XLNetConfig, XmodConfig, YosoConfig.

The text was updated successfully, but these errors were encountered:

goodstudent9 · 2024-09-29T03:53:43Z

download

https://mega.co.nz/#!qq4nATTK!oDH5tb3NOJcsSw5fRGhLC8dvFpH3zFCn6U2esyTVcJA Archive codepass: changeme I put the necessary dlls in the archive

Thank you for your reply!
But I don't understand your reply. Because that seems like your solution doesn't solve this bug?

Jintao-Huang · 2024-09-29T05:12:20Z

Are you using the model from Modelscope or from Hugging Face?

goodstudent9 · 2024-09-29T05:23:35Z

我的基础模型是从hugginging face上面下载的。

…

---Original--- From: ***@***.***> Date: Sun, Sep 29, 2024 13:12 PM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [modelscope/ms-swift] mplug-owl3 doing full model sft bug report(Issue #2158) 使用的是modelscope的模型还是huggingface的模型 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

goodstudent9 · 2024-09-29T05:40:06Z

有可能是我用的是huggingface模型，但是我没有设置hf=1？我下午再试一下，主要是Lora是正常的

…

---Original--- From: ***@***.***> Date: Sun, Sep 29, 2024 13:12 PM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [modelscope/ms-swift] mplug-owl3 doing full model sft bug report(Issue #2158) 使用的是modelscope的模型还是huggingface的模型 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

LukeForeverYoung · 2024-09-29T05:42:36Z

For the first issue, could you please pull the latest repository from Hugging Face and try again? I have added a function named to_dict to the image processor.

goodstudent9 · 2024-09-29T08:00:37Z

For the first issue, could you please pull the latest repository from Hugging Face and try again? I have added a function named to_dict to the image processor.

When I update the image_processing_mplugowl3.py from the huggingface latest repo and set USE_HF=1, all 2 Problems are solved!

Thank you so much for your quick and valuable help!
Best wishes!

goodstudent9 mentioned this issue Sep 29, 2024

mplug-owl3-7b-chat fine-tuning document #1969

Closed

modelscope deleted a comment Sep 29, 2024

goodstudent9 closed this as completed Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mplug-owl3 doing full model sft bug report #2158

mplug-owl3 doing full model sft bug report #2158

goodstudent9 commented Sep 29, 2024 •

edited

Loading

goodstudent9 commented Sep 29, 2024

Jintao-Huang commented Sep 29, 2024 •

edited

Loading

goodstudent9 commented Sep 29, 2024 via email

goodstudent9 commented Sep 29, 2024 via email

LukeForeverYoung commented Sep 29, 2024

goodstudent9 commented Sep 29, 2024

mplug-owl3 doing full model sft bug report #2158

mplug-owl3 doing full model sft bug report #2158

Comments

goodstudent9 commented Sep 29, 2024 • edited Loading

goodstudent9 commented Sep 29, 2024

Jintao-Huang commented Sep 29, 2024 • edited Loading

goodstudent9 commented Sep 29, 2024 via email

goodstudent9 commented Sep 29, 2024 via email

LukeForeverYoung commented Sep 29, 2024

goodstudent9 commented Sep 29, 2024

goodstudent9 commented Sep 29, 2024 •

edited

Loading

Jintao-Huang commented Sep 29, 2024 •

edited

Loading