-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trainer - deprecate tokenizer for processing_class #32385
Trainer - deprecate tokenizer for processing_class #32385
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Love this (Always confused about writing |
26f0430
to
32f7cc3
Compare
@@ -60,7 +60,9 @@ | |||
from .data.data_collator import DataCollator, DataCollatorWithPadding, default_data_collator | |||
from .debug_utils import DebugOption, DebugUnderflowOverflow | |||
from .feature_extraction_sequence_utils import SequenceFeatureExtractor | |||
from .feature_extraction_utils import FeatureExtractionMixin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file contains the most important logic changes for review
@@ -397,12 +399,12 @@ def on_prediction_step(self, args: TrainingArguments, state: TrainerState, contr | |||
class CallbackHandler(TrainerCallback): | |||
"""Internal class that just calls the list of callbacks in order.""" | |||
|
|||
def __init__(self, callbacks, model, tokenizer, optimizer, lr_scheduler): | |||
def __init__(self, callbacks, model, processing_class, optimizer, lr_scheduler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I directly just swapped the names here, as the docstring says the object is just an internal class
@molbap If you have capacity - do you think you could do a first review of this? Arthur and Zack are super busy so would be good to have someone who knows the processors well to review in their stead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked out the rework, nice - saw that now you can't pass both tokenizer
and processing_class
, it will rightfully raise an error, much better. lgtm!
Asking for review from @SunMarc for Trainer as Zach is off. As this is a very public API I'd like to get a few eyes on it, especially from people who know it well, to make sure it's OK! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean, nothing to say! Thanks for ths!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @amyeroberts ! LGTM !
@@ -3741,6 +3744,98 @@ def test_eval_use_gather_object(self): | |||
_ = trainer.evaluate() | |||
_ = trainer.predict(eval_dataset) | |||
|
|||
def test_trainer_saves_tokenizer(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding these nice tests !
e3b3aa9
to
5503bce
Compare
we use positional args to obtain model and optimizer, however, this has the un-needed tokenizer argument between them due to recent changes, the tokenizer arg is now renamed to processing_class, see: + huggingface/trl#2162 + huggingface/transformers#32385 leading to unexpected breakdown of scanner the line relevant to us is here: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_callback.py#L523 since we anyway don't depend on this arg, switch out to using model and opt from the kwargs
_pad_tensors_to_max_len in Seq2SeqTrainer still uses tokenizer instead of processing_class |
* Trainer - deprecate tokenizer for processing_class * Extend chage across Seq2Seq trainer and docs * Add tests * Update to FutureWarning and add deprecation version
For reviewers: most files touched here are just updates to the documentation. Unfortunately, a lot of the diff is just stripping of unnecessary whitespaces (which my editor does automatically).
The most important changes are in:
What does this PR do?
At the moment, if we wish for a processing class e.g. an image processor, to be saved alongside the model e.g. when pushing to the hub, we have to do the following:
This causes a lot of confusion for users (an image processor is not a tokenizer).
Previous efforts were made to add individual classes to the trainer s.t. one could do:
trainer = Trainer(model, args, image_processor=image_processor, ...)
, however, this has some drawbacks:Trainer(model, args, processor=processor)
or should it beTrainer(model, args, tokenizer=tokenizer, image_processor=image_processor)
?This PR therefore just deprecates
tokenizer
in favour of a more generalprocessing_class
argument, which can be any class with afrom_pretrained
method.Relevant PRs and discussions:
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.