-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload Model to Kaggle #1512
Upload Model to Kaggle #1512
Conversation
From what I understand from this PR and #1510 (comment), the goal is to be able to do something like this, right? tokenizer.save_to_preset(dir)
backbone.save_to_preset(dir)
upload_preset("kaggle://user/model", dir) Given that # Case 1.: save to local directory + upload it
tokenizer.save_to_preset(dir)
backbone.save_to_preset(dir)
upload_preset("kaggle://user/model", dir)
# Case 2.: upload directly (i.e. not saved locally or only to a tmp dir)
tokenizer.save_to_preset("kaggle://user/model")
backbone.save_to_preset("kaggle://user/model") My suggestion is simply to have a "if starts with prefix, then save to tmp dir + upload directly" in |
Thanks for reviewing and sharing your feedback, @Wauplin! Right! Case 1 is how we're planning to implement the model upload: save to a local dir and then upload! I think uploading directly (case 2) is a nice design. However, kagglehub has immutable model versions so in case 2, when the tokenizer is uploaded, it creates version We need to have all the model components saved before uploading. |
Understood! Then leaving it as it is is fine I guess :) Thanks for the explanation! Let me know once this is merged and I can contribute the |
keras_nlp/models/preprocessor.py
Outdated
@@ -96,6 +97,13 @@ def from_preset( | |||
) | |||
return cls(tokenizer=tokenizer, **kwargs) | |||
|
|||
def save_to_preset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually thing we have a bug here in the format of our one toy classification preset.
https://www.kaggle.com/models/keras/bert/frameworks/keras/variations/bert_tiny_en_uncased_sst2
Basically BertClassifier.from_preset("bert_tiny_en_uncased_sst2").preprocessor
is not always the same as BertPreprocessor.from_preset("bert_tiny_en_uncased_sst2")
. Or rather, it is currently, but only because the preprocessing layer happens to have all default parameters. If a preprocessor had custom config options when we saved that would not currently get reflected during preprocessor load.
We need to clean this up as part of our saving flow whatever we come up with. Save a preprocessor.json
? Not sure, but let's discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this up! I think adding a preprocessor.json
is a good idea because it seems that we need it to be able to support saving/loading preprocessor separately.
It's been helpful for me to list out some design principals we could follow here:
I think we have those, except for the preprocessor issue above. Also a question, if a user is saving a task preset, and the |
Thanks for sharing your thoughts, Matt! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Dropped some comments.
if not os.path.exists(preset): | ||
raise FileNotFoundError(f"The preset directory {preset} doesn't exist.") | ||
|
||
if uri.startswith(KAGGLE_PREFIX): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's factor this into a separate upload_directory
call or something like that. That way we have an easy extension point for @Wauplin's PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure if I fully understand this comment. I think it's good to separate validation from upload so I moved the validation code before this if
. Right now, HF can do the following:
if uri.startswith(HF_PREFIX):
hf_handle = uri.removeprefix(HF_PREFIX)
hf_hub.model_upload(hf_handle, preset)
Do you mean to put this in a separate function to prevent repetition for different hubs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's scratch this comment. I was thinking to encapsulate the file download/upload methods to one spot in this file so it's easy to extend to and separated from our saving loading business logic. But I don't think there is a huge difference. Let's leave as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if uri.startswith(HF_PREFIX): hf_handle = uri.removeprefix(HF_PREFIX) huggingface_hub.upload_folder(hf_handle, preset)
Wonderful if I can add HF support with only this code change! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that's pretty much it!
And we can do this now that the PR is merged!
@fchollet what do you think about where we expose our new symbol?
|
Is a preset always model-related? If not, I'd recommend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just a couple last comments.
if not os.path.exists(preset): | ||
raise FileNotFoundError(f"The preset directory {preset} doesn't exist.") | ||
|
||
if uri.startswith(KAGGLE_PREFIX): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's scratch this comment. I was thinking to encapsulate the file download/upload methods to one spot in this file so it's easy to extend to and separated from our saving loading business logic. But I don't think there is a huge difference. Let's leave as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a small review and can confirm adding HF integration will be very straightforward once this PR is merged. Thanks for pinging me :)
keras_nlp/tokenizers/tokenizer.py
Outdated
Args: | ||
preset: The path to the local model preset directory. | ||
""" | ||
save_to_preset(self, preset, config_filename="tokenizer.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit but can do
from keras_nlp.utils.preset_utils import save_to_preset, TOKENIZER_CONFIG_FILE
above and then
save_to_preset(self, preset, config_filename="tokenizer.json") | |
save_to_preset(self, preset, config_filename=TOKENIZER_CONFIG_FILE) |
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! Thanks! Done!
keras_nlp/utils/preset_utils.py
Outdated
return | ||
else: | ||
raise FileNotFoundError( | ||
f"`tokenizer.json` is missing from the preset directory `{preset}`. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit but I'd reuse the CONFIG_FILE
and TOKENIZER_CONFIG_FILE
constants in the errors below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
if not os.path.exists(preset): | ||
raise FileNotFoundError(f"The preset directory {preset} doesn't exist.") | ||
|
||
if uri.startswith(KAGGLE_PREFIX): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if uri.startswith(HF_PREFIX): hf_handle = uri.removeprefix(HF_PREFIX) huggingface_hub.upload_folder(hf_handle, preset)
Wonderful if I can add HF support with only this code change! 👍
* Initial Kaggle upload. * Address review comments. * Add upload valiations. * Address review comments. * Fix init. * Address review comments. * Improve error handling. * Address review comments.
Implement
upload_preset()
to allow users to upload Model presets to Kaggle.