Allow setting max length #176

blakechi · 2022-11-13T05:59:08Z

This PR is opened for resolving #172 .

Allow users to set max_length when calling SetFitTrainer.train.
max_length will be set to the maximum number of tokens the model body can handle if it's too large.
Add relevant tests.

A snippet of usage:

from datasets import load_dataset
from sentence_transformers.losses import CosineSimilarityLoss

from setfit import SetFitModel, SetFitTrainer


# Load a dataset from the Hugging Face Hub
dataset = load_dataset("sst2")

# Simulate the few-shot regime by sampling 8 examples per class
num_classes = 2
train_dataset = dataset["train"].shuffle(seed=42).select(range(8 * num_classes))
eval_dataset = dataset["validation"]

# Load a SetFit model from Hub
model = SetFitModel.from_pretrained(
    "sentence-transformers/paraphrase-mpnet-base-v2",
    use_differentiable_head=True,
    head_params={"out_features": num_classes},
)

# Create trainer
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    metric="accuracy",
    batch_size=16,
    num_iterations=20,
    num_epochs=1,
    column_mapping={"sentence": "text", "label": "label"}
)

# Train the differentiable head only
trainer.unfreeze(keep_body_frozen=True)

# Train with custom `max_length`
trainer.train(
    num_epochs=1,
    batch_size=16,
    body_learning_rate=1e-5,
    learning_rate=1e-2,
    l2_weight=0.0,
    max_length=64,  # set your preferred length here
)

lewtun

Thanks for this nice quality of life improvement @blakechi 🔥 !

The PR is looking great, and I've left a few small comments that would be nice to address before we merge

src/setfit/modeling.py

src/setfit/trainer.py

src/setfit/modeling.py

lewtun · 2022-11-15T08:32:39Z

tests/test_trainer.py

+            column_mapping={"text_new": "text", "label_new": "label"},
+        )
+        trainer.unfreeze(keep_body_frozen=True)
+        trainer.train(


Should we think about a way to actually test the large value has been overwritten with the model max length?

you're right. I will modify the test

just updated it. I tested whether the warning raises up correctly and checked the overwritten value in test_modeling.py.

lewtun · 2022-11-15T08:33:08Z

tests/test_trainer.py

+            column_mapping={"text_new": "text", "label_new": "label"},
+        )
+        trainer.unfreeze(keep_body_frozen=True)
+        trainer.train(


Same comment here - should we be testing the behaviour explicitly?

agree. Will push an update later

just updated it!

Co-authored-by: lewtun <[email protected]>

lewtun

Sorry for the delay and thanks for iterating @blakechi 🤘

I've left one nit and then think we can merge!

src/setfit/modeling.py

lewtun · 2022-11-23T13:17:33Z

tests/test_trainer.py

+                l2_weight=0.0,
+                max_length=max_length,
+            )
+            self.assertEqual(


Co-authored-by: lewtun <[email protected]>

blakechi · 2022-11-24T06:33:57Z

Sorry for the delay and thanks for iterating @blakechi 🤘

I've left one nit and then think we can merge!

No worries at all! Thanks for the review 🙏🏻

PhilipMay · 2022-11-30T17:33:13Z

src/setfit/trainer.py

@@ -279,6 +280,10 @@ def train(
                If ignore, will be the same as `learning_rate`.
            l2_weight (float, *optional*):
                Temporary change the weight of L2 regularization for SetFitModel's differentiable head in logistic regression.
+            max_length (int, *optional*, defaults to `None`):
+                The maximum number of tokens for one data sample. Currently only for training the differentiable head.
+                If`None`, will use the maximum number of tokens the model body can accept.


After "If" there is a space missing.

If`None`, will use the maximum number of tokens the model body can accept.

Thanks for finding that, @PhilipMay!

lewtun

Thanks for iterating - looks great!

blakechi added 2 commits November 12, 2022 23:28

open max_length to be set from SetFitTrainer.train

7c72fb9

added relevant tests

9b43125

lewtun reviewed Nov 15, 2022

View reviewed changes

blakechi and others added 7 commits November 15, 2022 17:21

Update src/setfit/modeling.py

bec3dcc

Co-authored-by: lewtun <[email protected]>

Update src/setfit/trainer.py

db992ca

Co-authored-by: lewtun <[email protected]>

Update src/setfit/trainer.py

8bdd4ef

Co-authored-by: lewtun <[email protected]>

Update src/setfit/modeling.py

9d0beb5

Co-authored-by: lewtun <[email protected]>

updated tests for max_length overwritten feature

ef61b45

reformatted the code

ff60b0c

remove an unused library

0971ae4

lewtun approved these changes Nov 23, 2022

View reviewed changes

Update src/setfit/modeling.py

5f9e963

Co-authored-by: lewtun <[email protected]>

warn when max_length is None

d1c6074

PhilipMay reviewed Nov 30, 2022

View reviewed changes

add a space in doc string

62b3c0f

lewtun approved these changes Dec 7, 2022

View reviewed changes

lewtun merged commit 0f828e4 into huggingface:main Dec 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow setting max length #176

Allow setting max length #176

blakechi commented Nov 13, 2022

lewtun left a comment

lewtun Nov 15, 2022

blakechi Nov 15, 2022

blakechi Nov 16, 2022 •

edited

Loading

lewtun Nov 15, 2022

blakechi Nov 15, 2022

blakechi Nov 16, 2022

lewtun left a comment

lewtun Nov 23, 2022

blakechi commented Nov 24, 2022

PhilipMay Nov 30, 2022 •

edited

Loading

blakechi Dec 1, 2022

lewtun left a comment

Allow setting max length #176

Allow setting max length #176

Conversation

blakechi commented Nov 13, 2022

lewtun left a comment

Choose a reason for hiding this comment

lewtun Nov 15, 2022

Choose a reason for hiding this comment

blakechi Nov 15, 2022

Choose a reason for hiding this comment

blakechi Nov 16, 2022 • edited Loading

Choose a reason for hiding this comment

lewtun Nov 15, 2022

Choose a reason for hiding this comment

blakechi Nov 15, 2022

Choose a reason for hiding this comment

blakechi Nov 16, 2022

Choose a reason for hiding this comment

lewtun left a comment

Choose a reason for hiding this comment

lewtun Nov 23, 2022

Choose a reason for hiding this comment

blakechi commented Nov 24, 2022

PhilipMay Nov 30, 2022 • edited Loading

Choose a reason for hiding this comment

blakechi Dec 1, 2022

Choose a reason for hiding this comment

lewtun left a comment

Choose a reason for hiding this comment

blakechi Nov 16, 2022 •

edited

Loading

PhilipMay Nov 30, 2022 •

edited

Loading