Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload Model to Kaggle #1512

Merged
merged 10 commits into from
Mar 25, 2024
Merged

Conversation

SamanehSaadat
Copy link
Member

Implement upload_preset() to allow users to upload Model presets to Kaggle.

@Wauplin
Copy link
Contributor

Wauplin commented Mar 15, 2024

From what I understand from this PR and #1510 (comment), the goal is to be able to do something like this, right?

tokenizer.save_to_preset(dir)
backbone.save_to_preset(dir)
upload_preset("kaggle://user/model", dir)

Given that load_from_preset is able to load from a local dir or a kaggle uri, wouldn't it be nice to also allow both behaviors in save_to_preset? Something like this:

# Case 1.: save to local directory + upload it
tokenizer.save_to_preset(dir)
backbone.save_to_preset(dir)
upload_preset("kaggle://user/model", dir)

# Case 2.: upload directly (i.e. not saved locally or only to a tmp dir)
tokenizer.save_to_preset("kaggle://user/model")
backbone.save_to_preset("kaggle://user/model")

My suggestion is simply to have a "if starts with prefix, then save to tmp dir + upload directly" in save_to_preset. In any case, I'm fine with any solution. From an huggingface_hub point of view it'll be straightforward to implement (similar to the kagglehub.model_upload(kaggle_handle, preset) line).

@SamanehSaadat
Copy link
Member Author

Thanks for reviewing and sharing your feedback, @Wauplin!

Right! Case 1 is how we're planning to implement the model upload: save to a local dir and then upload!

I think uploading directly (case 2) is a nice design. However, kagglehub has immutable model versions so in case 2, when the tokenizer is uploaded, it creates version X of the model and when backbone is uploaded later, it creates version X+1 of the model.

We need to have all the model components saved before uploading.

@Wauplin
Copy link
Contributor

Wauplin commented Mar 18, 2024

I think uploading directly (case 2) is a nice design. However, kagglehub has immutable model versions so in case 2, when the tokenizer is uploaded, it creates version X of the model and when backbone is uploaded later, it creates version X+1 of the model.

Understood! Then leaving it as it is is fine I guess :) Thanks for the explanation!

Let me know once this is merged and I can contribute the hf:// integration right after.

@@ -96,6 +97,13 @@ def from_preset(
)
return cls(tokenizer=tokenizer, **kwargs)

def save_to_preset(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually thing we have a bug here in the format of our one toy classification preset.

https://www.kaggle.com/models/keras/bert/frameworks/keras/variations/bert_tiny_en_uncased_sst2

Basically BertClassifier.from_preset("bert_tiny_en_uncased_sst2").preprocessor is not always the same as BertPreprocessor.from_preset("bert_tiny_en_uncased_sst2"). Or rather, it is currently, but only because the preprocessing layer happens to have all default parameters. If a preprocessor had custom config options when we saved that would not currently get reflected during preprocessor load.

We need to clean this up as part of our saving flow whatever we come up with. Save a preprocessor.json? Not sure, but let's discuss.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up! I think adding a preprocessor.json is a good idea because it seems that we need it to be able to support saving/loading preprocessor separately.

@mattdangerw
Copy link
Member

It's been helpful for me to list out some design principals we could follow here:

  • Idempotency. Save an object into the preset, get the same object out.
  • Able to break down to lower-level core Keras APIs. E.g. load a backbone with keras.saving.deserialize_keras_object(file) + model.load_weights(file).

I think we have those, except for the preprocessor issue above.

Also a question, if a user is saving a task preset, and the preprocessor is None (either it's just a custom function not a layer, or unattached to the task). What do we do? If we want idempotent saving, you save a task with no preprocessing, you will load a task with no preprocessing. But we might want to indicate to users this might hurt the usability of their preset. A warning?

@SamanehSaadat
Copy link
Member Author

Thanks for sharing your thoughts, Matt!
As we discussed in the meeting, we want to be able to support partial model upload, e.g. preprocessor without backbone or backbone without tokenizer. So we just need to build safeguards around it and make sure the user knows what they're doing.

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Dropped some comments.

if not os.path.exists(preset):
raise FileNotFoundError(f"The preset directory {preset} doesn't exist.")

if uri.startswith(KAGGLE_PREFIX):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's factor this into a separate upload_directory call or something like that. That way we have an easy extension point for @Wauplin's PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure if I fully understand this comment. I think it's good to separate validation from upload so I moved the validation code before this if. Right now, HF can do the following:

if uri.startswith(HF_PREFIX):
    hf_handle = uri.removeprefix(HF_PREFIX)
    hf_hub.model_upload(hf_handle, preset)

Do you mean to put this in a separate function to prevent repetition for different hubs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's scratch this comment. I was thinking to encapsulate the file download/upload methods to one spot in this file so it's easy to extend to and separated from our saving loading business logic. But I don't think there is a huge difference. Let's leave as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if uri.startswith(HF_PREFIX):
   hf_handle = uri.removeprefix(HF_PREFIX)
   huggingface_hub.upload_folder(hf_handle, preset)

Wonderful if I can add HF support with only this code change! 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that's pretty much it!
And we can do this now that the PR is merged!

@mattdangerw
Copy link
Member

@fchollet what do you think about where we expose our new symbol?

keras_nlp.upload_preset() vs keras_nlp.utils.upload_preset() vs keras_nlp.models.upload_preset()?

@fchollet
Copy link
Collaborator

keras_nlp.upload_preset() vs keras_nlp.utils.upload_preset() vs keras_nlp.models.upload_preset()?

Is a preset always model-related? If not, I'd recommend keras_nlp.upload_preset().

Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a couple last comments.

if not os.path.exists(preset):
raise FileNotFoundError(f"The preset directory {preset} doesn't exist.")

if uri.startswith(KAGGLE_PREFIX):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's scratch this comment. I was thinking to encapsulate the file download/upload methods to one spot in this file so it's easy to extend to and separated from our saving loading business logic. But I don't think there is a huge difference. Let's leave as is.

@mattdangerw mattdangerw marked this pull request as ready for review March 23, 2024 00:17
Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a small review and can confirm adding HF integration will be very straightforward once this PR is merged. Thanks for pinging me :)

Args:
preset: The path to the local model preset directory.
"""
save_to_preset(self, preset, config_filename="tokenizer.json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit but can do

from keras_nlp.utils.preset_utils import save_to_preset, TOKENIZER_CONFIG_FILE

above and then

Suggested change
save_to_preset(self, preset, config_filename="tokenizer.json")
save_to_preset(self, preset, config_filename=TOKENIZER_CONFIG_FILE)

here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Thanks! Done!

return
else:
raise FileNotFoundError(
f"`tokenizer.json` is missing from the preset directory `{preset}`. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit but I'd reuse the CONFIG_FILE and TOKENIZER_CONFIG_FILE constants in the errors below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

if not os.path.exists(preset):
raise FileNotFoundError(f"The preset directory {preset} doesn't exist.")

if uri.startswith(KAGGLE_PREFIX):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if uri.startswith(HF_PREFIX):
   hf_handle = uri.removeprefix(HF_PREFIX)
   huggingface_hub.upload_folder(hf_handle, preset)

Wonderful if I can add HF support with only this code change! 👍

@SamanehSaadat SamanehSaadat merged commit 0dc383c into keras-team:master Mar 25, 2024
10 checks passed
abuelnasr0 pushed a commit to abuelnasr0/keras-nlp that referenced this pull request Apr 2, 2024
* Initial Kaggle upload.

* Address review comments.

* Add upload valiations.

* Address review comments.

* Fix init.

* Address review comments.

* Improve error handling.

* Address review comments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants