[Don't merge] New design proposition for MAPPINGS in "auto" files #9305
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR would solve the issue: #9250 but should not be used as a solution.
The PR should rather just show how the current design of all
OrderedDicts
, calledMAPPINGS_...
is suboptimal. It's impossible to add two values if both values have the same key. We need to be able to add a tokenizer class toAutoTokenizers
even if the tokenizer does not have its own unique configuration class. We had a similar problem for Rag, since there isRagForSequenceGeneration
andRagForTokenGeneration
which both should be in the same mapping. IMO, the only 100% where we prevent "key" conflicts is if we use "multi-key" to "value" mappings as follows:Tokenizer:
(PretrainedConfig (the corresponding config class, we're using now), str (the tokenizer class as a string, sometimes saved under
config.tokenizer_class
) -> TokenizerClassModel:
(PretrainedConfig (the corresponding config class, we're using now), str (the model type as a string, sometimes saved under
config.model_type
) -> ModelClassSome other "less" important shortcomings of this design:
isinstance
whether a config class is in an OrderedDict, we need to be very careful about the position of the key in the ordered dict and even wrote a test for this:transformers/tests/test_modeling_auto.py
Line 221 in 21fc676
transformers/src/transformers/models/auto/tokenization_auto.py
Line 249 in 21fc676
=> I would propose that we change all "MAPPING_FOR_..." to a class
MAPPING_FOR_
where we make sure that 100% backward compatibility is kept (except for that now it's not anymore aOrderedDict
anymore, but a class.)We can implement a
__getitem__
that could take inputs of different types (config for backward comp, but maybe also a "str" corresponding to the"tokenizer_class"
or"model_type"
). In general, it would give us more flexibility and prevent errors such as the one linked to this PR.A possible design could look like this:
Keen to hear your thoughts on this @LysandreJik, @sgugger, @julien-c before opening a PR.