Upload Model to Kaggle #1512

SamanehSaadat · 2024-03-14T01:32:56Z

Implement upload_preset() to allow users to upload Model presets to Kaggle.

Wauplin · 2024-03-15T15:41:54Z

From what I understand from this PR and #1510 (comment), the goal is to be able to do something like this, right?

tokenizer.save_to_preset(dir)
backbone.save_to_preset(dir)
upload_preset("kaggle://user/model", dir)

Given that load_from_preset is able to load from a local dir or a kaggle uri, wouldn't it be nice to also allow both behaviors in save_to_preset? Something like this:

# Case 1.: save to local directory + upload it
tokenizer.save_to_preset(dir)
backbone.save_to_preset(dir)
upload_preset("kaggle://user/model", dir)

# Case 2.: upload directly (i.e. not saved locally or only to a tmp dir)
tokenizer.save_to_preset("kaggle://user/model")
backbone.save_to_preset("kaggle://user/model")

My suggestion is simply to have a "if starts with prefix, then save to tmp dir + upload directly" in save_to_preset. In any case, I'm fine with any solution. From an huggingface_hub point of view it'll be straightforward to implement (similar to the kagglehub.model_upload(kaggle_handle, preset) line).

SamanehSaadat · 2024-03-15T17:07:25Z

Thanks for reviewing and sharing your feedback, @Wauplin!

Right! Case 1 is how we're planning to implement the model upload: save to a local dir and then upload!

I think uploading directly (case 2) is a nice design. However, kagglehub has immutable model versions so in case 2, when the tokenizer is uploaded, it creates version X of the model and when backbone is uploaded later, it creates version X+1 of the model.

We need to have all the model components saved before uploading.

Wauplin · 2024-03-18T14:32:12Z

I think uploading directly (case 2) is a nice design. However, kagglehub has immutable model versions so in case 2, when the tokenizer is uploaded, it creates version X of the model and when backbone is uploaded later, it creates version X+1 of the model.

Understood! Then leaving it as it is is fine I guess :) Thanks for the explanation!

Let me know once this is merged and I can contribute the hf:// integration right after.

mattdangerw · 2024-03-18T19:38:53Z

keras_nlp/models/preprocessor.py

@@ -96,6 +97,13 @@ def from_preset(
        )
        return cls(tokenizer=tokenizer, **kwargs)

+    def save_to_preset(


I actually thing we have a bug here in the format of our one toy classification preset.

https://www.kaggle.com/models/keras/bert/frameworks/keras/variations/bert_tiny_en_uncased_sst2

Basically BertClassifier.from_preset("bert_tiny_en_uncased_sst2").preprocessor is not always the same as BertPreprocessor.from_preset("bert_tiny_en_uncased_sst2"). Or rather, it is currently, but only because the preprocessing layer happens to have all default parameters. If a preprocessor had custom config options when we saved that would not currently get reflected during preprocessor load.

We need to clean this up as part of our saving flow whatever we come up with. Save a preprocessor.json? Not sure, but let's discuss.

Thanks for bringing this up! I think adding a preprocessor.json is a good idea because it seems that we need it to be able to support saving/loading preprocessor separately.

mattdangerw · 2024-03-19T17:12:32Z

It's been helpful for me to list out some design principals we could follow here:

Idempotency. Save an object into the preset, get the same object out.
Able to break down to lower-level core Keras APIs. E.g. load a backbone with keras.saving.deserialize_keras_object(file) + model.load_weights(file).

I think we have those, except for the preprocessor issue above.

Also a question, if a user is saving a task preset, and the preprocessor is None (either it's just a custom function not a layer, or unattached to the task). What do we do? If we want idempotent saving, you save a task with no preprocessing, you will load a task with no preprocessing. But we might want to indicate to users this might hurt the usability of their preset. A warning?

SamanehSaadat · 2024-03-19T19:02:43Z

Thanks for sharing your thoughts, Matt!
As we discussed in the meeting, we want to be able to support partial model upload, e.g. preprocessor without backbone or backbone without tokenizer. So we just need to build safeguards around it and make sure the user knows what they're doing.

mattdangerw

Looks good! Dropped some comments.

keras_nlp/models/task.py

keras_nlp/utils/preset_utils.py

keras_nlp/tokenizers/tokenizer.py

keras_nlp/utils/preset_utils.py

mattdangerw · 2024-03-22T19:27:51Z

keras_nlp/utils/preset_utils.py

+    if not os.path.exists(preset):
+        raise FileNotFoundError(f"The preset directory {preset} doesn't exist.")
+
+    if uri.startswith(KAGGLE_PREFIX):


Let's factor this into a separate upload_directory call or something like that. That way we have an easy extension point for @Wauplin's PR.

I'm not quite sure if I fully understand this comment. I think it's good to separate validation from upload so I moved the validation code before this if. Right now, HF can do the following:

if uri.startswith(HF_PREFIX): hf_handle = uri.removeprefix(HF_PREFIX) hf_hub.model_upload(hf_handle, preset)

Do you mean to put this in a separate function to prevent repetition for different hubs?

Yeah, let's scratch this comment. I was thinking to encapsulate the file download/upload methods to one spot in this file so it's easy to extend to and separated from our saving loading business logic. But I don't think there is a huge difference. Let's leave as is.

if uri.startswith(HF_PREFIX): hf_handle = uri.removeprefix(HF_PREFIX) huggingface_hub.upload_folder(hf_handle, preset)

Wonderful if I can add HF support with only this code change! 👍

Yeah, I think that's pretty much it!
And we can do this now that the PR is merged!

keras_nlp/utils/__init__.py

keras_nlp/utils/preset_utils.py

mattdangerw · 2024-03-22T19:33:02Z

@fchollet what do you think about where we expose our new symbol?

keras_nlp.upload_preset() vs keras_nlp.utils.upload_preset() vs keras_nlp.models.upload_preset()?

fchollet · 2024-03-22T19:43:10Z

keras_nlp.upload_preset() vs keras_nlp.utils.upload_preset() vs keras_nlp.models.upload_preset()?

Is a preset always model-related? If not, I'd recommend keras_nlp.upload_preset().

mattdangerw

LGTM! Just a couple last comments.

keras_nlp/models/backbone.py

keras_nlp/tokenizers/tokenizer.py

keras_nlp/models/backbone.py

keras_nlp/utils/preset_utils.py

mattdangerw · 2024-03-23T00:15:35Z

keras_nlp/utils/preset_utils.py

+    if not os.path.exists(preset):
+        raise FileNotFoundError(f"The preset directory {preset} doesn't exist.")
+
+    if uri.startswith(KAGGLE_PREFIX):


Yeah, let's scratch this comment. I was thinking to encapsulate the file download/upload methods to one spot in this file so it's easy to extend to and separated from our saving loading business logic. But I don't think there is a huge difference. Let's leave as is.

Wauplin

Made a small review and can confirm adding HF integration will be very straightforward once this PR is merged. Thanks for pinging me :)

Wauplin · 2024-03-25T10:31:12Z

keras_nlp/tokenizers/tokenizer.py

+        Args:
+            preset: The path to the local model preset directory.
+        """
+        save_to_preset(self, preset, config_filename="tokenizer.json")


Small nit but can do

from keras_nlp.utils.preset_utils import save_to_preset, TOKENIZER_CONFIG_FILE

above and then

Suggested change

save_to_preset(self, preset, config_filename="tokenizer.json")

save_to_preset(self, preset, config_filename=TOKENIZER_CONFIG_FILE)

here

Good idea! Thanks! Done!

Wauplin · 2024-03-25T10:32:31Z

keras_nlp/utils/preset_utils.py

+            return
+        else:
+            raise FileNotFoundError(
+                f"`tokenizer.json` is missing from the preset directory `{preset}`. "


small nit but I'd reuse the CONFIG_FILE and TOKENIZER_CONFIG_FILE constants in the errors below

Wauplin · 2024-03-25T10:34:34Z

keras_nlp/utils/preset_utils.py

+    if not os.path.exists(preset):
+        raise FileNotFoundError(f"The preset directory {preset} doesn't exist.")
+
+    if uri.startswith(KAGGLE_PREFIX):


if uri.startswith(HF_PREFIX): hf_handle = uri.removeprefix(HF_PREFIX) huggingface_hub.upload_folder(hf_handle, preset)

Wonderful if I can add HF support with only this code change! 👍

* Initial Kaggle upload. * Address review comments. * Add upload valiations. * Address review comments. * Fix init. * Address review comments. * Improve error handling. * Address review comments.

SamanehSaadat added 2 commits March 13, 2024 21:54

Initial Kaggle upload.

d9eada8

Address review comments.

7b18970

mattdangerw mentioned this pull request Mar 14, 2024

Add from_huggingface method to KerasNLP models #1294

Open

mattdangerw reviewed Mar 18, 2024

View reviewed changes

Merge branch 'keras-team:master' into kaggle-upload

e578789

SamanehSaadat added 2 commits March 22, 2024 12:04

Merge branch 'keras-team:master' into kaggle-upload

4d5faae

Add upload valiations.

42f43d9

mattdangerw reviewed Mar 22, 2024

View reviewed changes

SamanehSaadat added 2 commits March 22, 2024 23:18

Address review comments.

5beff57

Fix init.

713a60f

mattdangerw approved these changes Mar 23, 2024

View reviewed changes

mattdangerw marked this pull request as ready for review March 23, 2024 00:17

SamanehSaadat added 2 commits March 23, 2024 00:51

Address review comments.

a2e8c92

Improve error handling.

ca2992c

Wauplin approved these changes Mar 25, 2024

View reviewed changes

Address review comments.

b378d27

SamanehSaadat merged commit 0dc383c into keras-team:master Mar 25, 2024
10 checks passed

mattdangerw mentioned this pull request Mar 25, 2024

Allow saving / loading from Huggingface Hub preset #1510

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upload Model to Kaggle #1512

Upload Model to Kaggle #1512

SamanehSaadat commented Mar 14, 2024

Wauplin commented Mar 15, 2024

SamanehSaadat commented Mar 15, 2024

Wauplin commented Mar 18, 2024 •

edited

Loading

mattdangerw Mar 18, 2024

SamanehSaadat Mar 19, 2024

mattdangerw commented Mar 19, 2024

SamanehSaadat commented Mar 19, 2024

mattdangerw left a comment

mattdangerw Mar 22, 2024

SamanehSaadat Mar 22, 2024

mattdangerw Mar 23, 2024

Wauplin Mar 25, 2024

SamanehSaadat Mar 25, 2024

mattdangerw commented Mar 22, 2024

fchollet commented Mar 22, 2024

mattdangerw left a comment

mattdangerw Mar 23, 2024

Wauplin left a comment

Wauplin Mar 25, 2024

SamanehSaadat Mar 25, 2024

Wauplin Mar 25, 2024

SamanehSaadat Mar 25, 2024

Wauplin Mar 25, 2024

	save_to_preset(self, preset, config_filename="tokenizer.json")
	save_to_preset(self, preset, config_filename=TOKENIZER_CONFIG_FILE)

Upload Model to Kaggle #1512

Upload Model to Kaggle #1512

Conversation

SamanehSaadat commented Mar 14, 2024

Wauplin commented Mar 15, 2024

SamanehSaadat commented Mar 15, 2024

Wauplin commented Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw commented Mar 19, 2024

SamanehSaadat commented Mar 19, 2024

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdangerw commented Mar 22, 2024

fchollet commented Mar 22, 2024

mattdangerw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Wauplin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Wauplin commented Mar 18, 2024 •

edited

Loading