Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making ModelHubMixin.save_pretrained save self.config on demand for integrated libraries #2102

Closed
Natooz opened this issue Mar 8, 2024 · 5 comments

Comments

@Natooz
Copy link

Natooz commented Mar 8, 2024

Hi there 👋

Problem

Since v0.21 ModelHubMixin.save_pretrained will automatically save self.config after calling self._save_pretrained.
This can be problematic if an integrated library subclassing ModelHubMixin already handle saving its self.config in the child self._save_pretrained method. This will either result in a config file saved twice, or crashing if self.config is not serializable.

### Solution

Integrated libraries could override ModelHubMixin.save_pretrained to remove this part, as currently done in MidiTok Natooz/MidiTok#150.
This is however not an ideal solution, as it is better for libraries to override as little as possible to benefit from the latest version updates and make sure nothing will break in the future.

To solve this, maybe ModelHubMixin can implement a protected attribute specifying wether to automatically save self.config, so that integrated libraries can control this part of the process without overriding the parent method.

cc @Wauplin 🤗

@Natooz
Copy link
Author

Natooz commented Mar 8, 2024

Btw I assume the PushToHubMixin mixin used for transformers.PreTrainedTokenizerBase is something that will be replaced by ModelHubMixin or something from the hub client in the future?

@Wauplin
Copy link
Contributor

Wauplin commented Mar 8, 2024

Hi @Natooz, thanks for letting me know and sorry about the breaking change this created to MidiTok 😬 Just to be sure to understand, what was the issue you got? Was it because of a non-serializable config or a overwrite?

We could fix the first one with a try/except.
For the later, we could check if config.json already exist and save it only if missing.

What do you think?

@Natooz
Copy link
Author

Natooz commented Mar 8, 2024

The main issue in my case is that the child method self._save_pretrained already saves self.config within an other json file that encompasses the whole tokenizer (its config and other things such as vocabulary). In short there is no need for the parent save_pretrained to save self.config in the first place.
Hence adding somehow a condition to make it save the config would make this "use-case" possible.

The reason I mentioned the tokenizers from the transformers lib is that I assume that's also probably something that could be desired in the event where thy were to subclass ModelHubMixin (but maybe this isn't planed at all)

@Wauplin
Copy link
Contributor

Wauplin commented Mar 11, 2024

Thanks for explaining! I opened a PR to fix it: #2105 :)

@Wauplin
Copy link
Contributor

Wauplin commented Apr 23, 2024

Fixed by #2142.

@Wauplin Wauplin closed this as completed Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants