Trainer - deprecate tokenizer for processing_class #32385

amyeroberts · 2024-08-01T20:06:54Z

For reviewers: most files touched here are just updates to the documentation. Unfortunately, a lot of the diff is just stripping of unnecessary whitespaces (which my editor does automatically).

The most important changes are in:

src/transformers/trainer.py
src/transformers/trainer_callback.py
src/transformers/trainer_seq2seq.py

What does this PR do?

At the moment, if we wish for a processing class e.g. an image processor, to be saved alongside the model e.g. when pushing to the hub, we have to do the following:

trainer = Trainer(
    model, 
    args, 
    tokenizer=image_processor, 
    ...
)

This causes a lot of confusion for users (an image processor is not a tokenizer).

Previous efforts were made to add individual classes to the trainer s.t. one could do: trainer = Trainer(model, args, image_processor=image_processor, ...), however, this has some drawbacks:

It creates ambiguity when a model has a processor class: should we be able to do Trainer(model, args, processor=processor) or should it be Trainer(model, args, tokenizer=tokenizer, image_processor=image_processor)?
It adds new arguments to handle for every processing class, meaning more work if future classes are added

This PR therefore just deprecates tokenizer in favour of a more general processing_class argument, which can be any class with a from_pretrained method.

Relevant PRs and discussions:

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2024-08-01T20:27:20Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SangbumChoi · 2024-08-02T08:39:46Z

Love this (Always confused about writing tokenizer=image_processor)

amyeroberts · 2024-09-09T12:11:39Z

src/transformers/trainer.py

@@ -60,7 +60,9 @@
 from .data.data_collator import DataCollator, DataCollatorWithPadding, default_data_collator
 from .debug_utils import DebugOption, DebugUnderflowOverflow
 from .feature_extraction_sequence_utils import SequenceFeatureExtractor
+from .feature_extraction_utils import FeatureExtractionMixin


This file contains the most important logic changes for review

amyeroberts · 2024-09-09T12:38:06Z

src/transformers/trainer_callback.py

@@ -397,12 +399,12 @@ def on_prediction_step(self, args: TrainingArguments, state: TrainerState, contr
 class CallbackHandler(TrainerCallback):
    """Internal class that just calls the list of callbacks in order."""

-    def __init__(self, callbacks, model, tokenizer, optimizer, lr_scheduler):
+    def __init__(self, callbacks, model, processing_class, optimizer, lr_scheduler):


I directly just swapped the names here, as the docstring says the object is just an internal class

amyeroberts · 2024-09-17T12:22:03Z

@molbap If you have capacity - do you think you could do a first review of this? Arthur and Zack are super busy so would be good to have someone who knows the processors well to review in their stead

molbap

Checked out the rework, nice - saw that now you can't pass both tokenizer and processing_class, it will rightfully raise an error, much better. lgtm!

amyeroberts · 2024-09-20T09:34:10Z

Asking for review from @SunMarc for Trainer as Zach is off. As this is a very public API I'd like to get a few eyes on it, especially from people who know it well, to make sure it's OK!

ArthurZucker

Clean, nothing to say! Thanks for ths!

src/transformers/trainer.py

SunMarc

Thanks for the PR @amyeroberts ! LGTM !

SunMarc · 2024-09-27T14:09:54Z

tests/trainer/test_trainer.py

@@ -3741,6 +3744,98 @@ def test_eval_use_gather_object(self):
        _ = trainer.evaluate()
        _ = trainer.predict(eval_dataset)

+    def test_trainer_saves_tokenizer(self):


Thanks for adding these nice tests !

src/transformers/trainer.py

we use positional args to obtain model and optimizer, however, this has the un-needed tokenizer argument between them due to recent changes, the tokenizer arg is now renamed to processing_class, see: + huggingface/trl#2162 + huggingface/transformers#32385 leading to unexpected breakdown of scanner the line relevant to us is here: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_callback.py#L523 since we anyway don't depend on this arg, switch out to using model and opt from the kwargs

goea-shuhei · 2024-11-06T00:45:09Z

_pad_tensors_to_max_len in Seq2SeqTrainer still uses tokenizer instead of processing_class

* Trainer - deprecate tokenizer for processing_class * Extend chage across Seq2Seq trainer and docs * Add tests * Update to FutureWarning and add deprecation version

amyeroberts mentioned this pull request Aug 1, 2024

Add basic eval table logging for WandbCallback #31050

Open

5 tasks

amyeroberts force-pushed the allow-all-processing-classes-trainer branch from 26f0430 to 32f7cc3 Compare September 9, 2024 11:53

amyeroberts commented Sep 9, 2024

View reviewed changes

amyeroberts requested review from muellerzr and ArthurZucker September 9, 2024 17:56

molbap self-requested a review September 17, 2024 14:20

molbap approved these changes Sep 18, 2024

View reviewed changes

amyeroberts requested a review from SunMarc September 20, 2024 09:34

ArthurZucker approved these changes Sep 27, 2024

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

SunMarc approved these changes Sep 27, 2024

View reviewed changes

amyeroberts added 4 commits October 2, 2024 13:17

Trainer - deprecate tokenizer for processing_class

fd1996d

Extend chage across Seq2Seq trainer and docs

d8f4e8a

Add tests

8d5de25

Update to FutureWarning and add deprecation version

5503bce

amyeroberts force-pushed the allow-all-processing-classes-trainer branch from e3b3aa9 to 5503bce Compare October 2, 2024 12:18

amyeroberts merged commit b7474f2 into huggingface:main Oct 2, 2024
25 checks passed

amyeroberts deleted the allow-all-processing-classes-trainer branch October 2, 2024 13:08

qubvel mentioned this pull request Oct 2, 2024

Handle Trainer tokenizer kwarg deprecation with decorator #33887

Merged

qgallouedec mentioned this pull request Oct 3, 2024

AttributeError: property 'tokenizer' of 'DPOTrainer' object has no setter huggingface/trl#2161

Closed

4 tasks

ArthurZucker mentioned this pull request Oct 3, 2024

add setter for trainer processor #33911

Merged

muellerzr mentioned this pull request Oct 8, 2024

Fix PIL dep for tests #34028

Merged

5 tasks

tomaarsen mentioned this pull request Oct 28, 2024

[integration] Add support for Transformers v4.46.0 UKPLab/sentence-transformers#3026

Merged

techkang mentioned this pull request Nov 1, 2024

tokenizer is deprecated in the latest transformers and replaced with processing_class. Will llama-factory make the same change? hiyouga/LLaMA-Factory#5892

Closed

qgallouedec mentioned this pull request Nov 4, 2024

Rename trainer arg tokenizer to processing_class huggingface/trl#2162

Merged

18 tasks

jack89roberts mentioned this pull request Jan 15, 2025

Fix uploading processors/tokenizers to WandB on train end #35701

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer - deprecate tokenizer for processing_class #32385

Trainer - deprecate tokenizer for processing_class #32385

amyeroberts commented Aug 1, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 1, 2024

SangbumChoi commented Aug 2, 2024 •

edited

Loading

amyeroberts Sep 9, 2024

amyeroberts Sep 9, 2024

amyeroberts commented Sep 17, 2024

molbap left a comment

amyeroberts commented Sep 20, 2024 •

edited

Loading

ArthurZucker left a comment

SunMarc left a comment

SunMarc Sep 27, 2024

goea-shuhei commented Nov 6, 2024

Trainer - deprecate tokenizer for processing_class #32385

Trainer - deprecate tokenizer for processing_class #32385

Conversation

amyeroberts commented Aug 1, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Aug 1, 2024

SangbumChoi commented Aug 2, 2024 • edited Loading

amyeroberts Sep 9, 2024

Choose a reason for hiding this comment

amyeroberts Sep 9, 2024

Choose a reason for hiding this comment

amyeroberts commented Sep 17, 2024

molbap left a comment

Choose a reason for hiding this comment

amyeroberts commented Sep 20, 2024 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

SunMarc left a comment

Choose a reason for hiding this comment

SunMarc Sep 27, 2024

Choose a reason for hiding this comment

goea-shuhei commented Nov 6, 2024

amyeroberts commented Aug 1, 2024 •

edited

Loading

SangbumChoi commented Aug 2, 2024 •

edited

Loading

amyeroberts commented Sep 20, 2024 •

edited

Loading