Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write a light-weight benchmarking script to quickly evaluate our models #634

Closed
chenmoneygithub opened this issue Jan 4, 2023 · 9 comments
Labels
stat:contributions welcome Add this label to feature request issues so they are separated out from bug reporting issues

Comments

@chenmoneygithub
Copy link
Contributor

chenmoneygithub commented Jan 4, 2023

The code should go into keras_nlp/benchmarks.

We can use IMDB sentiment analysis task, guidance for which can be found here.

One challenging point is we want this script to be able to evaluate all our Classifier models without writing custom code. Since for all models Classifier we have Preprocessor, and they have the unified name format {model_name}Classifier/{model_name}Preprocessor, e.g., BertClassifier/BertPreprocessor, we should be able to make the code reusable by having a flag model_name.

Here is the requirement in more details:

  • example file name: keras_nlp/benchmarks/sentiment_analysis.py
  • example running command:
    python keras_nlp/benchmarks/sentiment_analysis.py \
        --model="bert" \
        --preset="bert_small_en_uncased" \
        --learning_rate=5e-5 \
        --num_epochs=5 \
        --batch_size=32
    
    flag --model specifies the model name, and --preset specifies the preset under testing. --preset could be None, while --model is required. Other flags are common training flags.
  • output: print out a few metrics, including
    • validation accuracy/F1 for each epoch.
    • testing accuracy/F1 after training is done.
    • total elapsed time (in seconds).
@chenmoneygithub chenmoneygithub added the stat:contributions welcome Add this label to feature request issues so they are separated out from bug reporting issues label Jan 4, 2023
@jbischof
Copy link
Contributor

jbischof commented Jan 5, 2023

@chenmoneygithub this isn't enough information to solicit contributions. What do you want to benchmark? What is the desired output? Should this be a direct GCP integration or just a python script?

@chenmoneygithub
Copy link
Contributor Author

@jbischof we can have strict requirements on outputs/metrics/logging later, I am making this flexible that any runnable IMDb review sentiment analysis script is welcome, and contributors could specify their own metrics.

I don't feel contributors will bother doing cloud integration, they don't have our GCP project access.

@jbischof
Copy link
Contributor

jbischof commented Jan 5, 2023

@chenmoneygithub that's one reason I'm not sure this is appropriate for contributors. Either way, we need a lot more details!

@chenmoneygithub
Copy link
Contributor Author

Sure! In a way I don't want the description to be an article (personally I am discouraged from reading those), so if some contributor expresses interest, I will provide more details to them directly.

@mattdangerw
Copy link
Member

My take is it useful to show the usage we want when we can.

  • For a new API we can basically write the key docstring examples in issue description.
  • For a tool like this, we could show the command line invocations we would like to support, and a little detail on what the output should be.

That will be useful information to give potential contributors, and make sure we get something back that is in line with our expectations.

@snoringpig
Copy link

Hi! I'm interested to try out - still reading the details. To clarify, are all models under this directory (https://github.com/keras-team/keras-nlp/tree/master/keras_nlp/models) "classifier models"? Thanks!

@mattdangerw
Copy link
Member

mattdangerw commented Jan 11, 2023

@snoringpig only the classes with classifier in the name are classifiers, e.g. BertClassifier, and RobertaClassifier.

The other main modeling classes we have are backbones, like BertBackbone. These are not specialized to a task, so would be more work (with little gain) to add to our benchmarking suite right now.

@chenmoneygithub the issue description looks good. I might add a few outputs.

  • Let's print out the train_step time.
  • Let's print out the test hardware via tf.config.list_physical_devices.

Then the output of these benchmarks can be a nice little report we can copy paste elsewhere. Wdyt?

@jbischof
Copy link
Contributor

We might need a better way to identify the models than "bert"....how about we give the class name like BertClassifier instead so we don't need to maintain a lookup table?

@mattdangerw
Copy link
Member

We definitely need more benchmarking with Keras 3 on the way, but will close this and reopen one with a better description for the multi-backend world.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:contributions welcome Add this label to feature request issues so they are separated out from bug reporting issues
Projects
None yet
Development

No branches or pull requests

4 participants