Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer - deprecate tokenizer for processing_class #32385

Merged

Conversation

amyeroberts
Copy link
Collaborator

@amyeroberts amyeroberts commented Aug 1, 2024

For reviewers: most files touched here are just updates to the documentation. Unfortunately, a lot of the diff is just stripping of unnecessary whitespaces (which my editor does automatically).

The most important changes are in:

  • src/transformers/trainer.py
  • src/transformers/trainer_callback.py
  • src/transformers/trainer_seq2seq.py

What does this PR do?

At the moment, if we wish for a processing class e.g. an image processor, to be saved alongside the model e.g. when pushing to the hub, we have to do the following:

trainer = Trainer(
    model, 
    args, 
    tokenizer=image_processor, 
    ...
)

This causes a lot of confusion for users (an image processor is not a tokenizer).

Previous efforts were made to add individual classes to the trainer s.t. one could do: trainer = Trainer(model, args, image_processor=image_processor, ...), however, this has some drawbacks:

  • It creates ambiguity when a model has a processor class: should we be able to do Trainer(model, args, processor=processor) or should it be Trainer(model, args, tokenizer=tokenizer, image_processor=image_processor)?
  • It adds new arguments to handle for every processing class, meaning more work if future classes are added

This PR therefore just deprecates tokenizer in favour of a more general processing_class argument, which can be any class with a from_pretrained method.

Relevant PRs and discussions:

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@SangbumChoi
Copy link
Contributor

SangbumChoi commented Aug 2, 2024

Love this (Always confused about writing tokenizer=image_processor)

@amyeroberts amyeroberts force-pushed the allow-all-processing-classes-trainer branch from 26f0430 to 32f7cc3 Compare September 9, 2024 11:53
@@ -60,7 +60,9 @@
from .data.data_collator import DataCollator, DataCollatorWithPadding, default_data_collator
from .debug_utils import DebugOption, DebugUnderflowOverflow
from .feature_extraction_sequence_utils import SequenceFeatureExtractor
from .feature_extraction_utils import FeatureExtractionMixin
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file contains the most important logic changes for review

@@ -397,12 +399,12 @@ def on_prediction_step(self, args: TrainingArguments, state: TrainerState, contr
class CallbackHandler(TrainerCallback):
"""Internal class that just calls the list of callbacks in order."""

def __init__(self, callbacks, model, tokenizer, optimizer, lr_scheduler):
def __init__(self, callbacks, model, processing_class, optimizer, lr_scheduler):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I directly just swapped the names here, as the docstring says the object is just an internal class

@amyeroberts
Copy link
Collaborator Author

@molbap If you have capacity - do you think you could do a first review of this? Arthur and Zack are super busy so would be good to have someone who knows the processors well to review in their stead

@molbap molbap self-requested a review September 17, 2024 14:20
Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked out the rework, nice - saw that now you can't pass both tokenizer and processing_class, it will rightfully raise an error, much better. lgtm!

@amyeroberts
Copy link
Collaborator Author

amyeroberts commented Sep 20, 2024

Asking for review from @SunMarc for Trainer as Zach is off. As this is a very public API I'd like to get a few eyes on it, especially from people who know it well, to make sure it's OK!

@amyeroberts amyeroberts requested a review from SunMarc September 20, 2024 09:34
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, nothing to say! Thanks for ths!

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @amyeroberts ! LGTM !

@@ -3741,6 +3744,98 @@ def test_eval_use_gather_object(self):
_ = trainer.evaluate()
_ = trainer.predict(eval_dataset)

def test_trainer_saves_tokenizer(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these nice tests !

@amyeroberts amyeroberts force-pushed the allow-all-processing-classes-trainer branch from e3b3aa9 to 5503bce Compare October 2, 2024 12:18
@amyeroberts amyeroberts merged commit b7474f2 into huggingface:main Oct 2, 2024
25 checks passed
@amyeroberts amyeroberts deleted the allow-all-processing-classes-trainer branch October 2, 2024 13:08
@muellerzr muellerzr mentioned this pull request Oct 8, 2024
5 tasks
ChanderG added a commit to foundation-model-stack/hf-resource-scanner that referenced this pull request Nov 4, 2024
we use positional args to obtain model and optimizer, however, this
has the un-needed tokenizer argument between them

due to recent changes, the tokenizer arg is now renamed to
processing_class, see:
+ huggingface/trl#2162
+ huggingface/transformers#32385
leading to unexpected breakdown of scanner

the line relevant to us is here:
https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_callback.py#L523

since we anyway don't depend on this arg, switch out to using model
and opt from the kwargs
@goea-shuhei
Copy link

_pad_tensors_to_max_len in Seq2SeqTrainer still uses tokenizer instead of processing_class

BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
* Trainer - deprecate tokenizer for processing_class

* Extend chage across Seq2Seq trainer and docs

* Add tests

* Update to FutureWarning and add deprecation version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants