[Don't merge] New design proposition for MAPPINGS in "auto" files #9305

patrickvonplaten · 2020-12-25T10:12:21Z

This PR would solve the issue: #9250 but should not be used as a solution.

The PR should rather just show how the current design of all OrderedDicts, called MAPPINGS_... is suboptimal. It's impossible to add two values if both values have the same key. We need to be able to add a tokenizer class to AutoTokenizers even if the tokenizer does not have its own unique configuration class. We had a similar problem for Rag, since there is RagForSequenceGeneration and RagForTokenGeneration which both should be in the same mapping. IMO, the only 100% where we prevent "key" conflicts is if we use "multi-key" to "value" mappings as follows:

Tokenizer:
(PretrainedConfig (the corresponding config class, we're using now), str (the tokenizer class as a string, sometimes saved under config.tokenizer_class) -> TokenizerClass

Model:
(PretrainedConfig (the corresponding config class, we're using now), str (the model type as a string, sometimes saved under config.model_type) -> ModelClass

Some other "less" important shortcomings of this design:

Because we often check with isinstance whether a config class is in an OrderedDict, we need to be very careful about the position of the key in the ordered dict and even wrote a test for this:

transformers/tests/test_modeling_auto.py

Line 221 in 21fc676

def test_parents_and_children_in_mappings(self):

. This added complexity for such a simple feature is quite unnecessary IMO.
These functions:

transformers/src/transformers/models/auto/tokenization_auto.py

Line 249 in 21fc676

def tokenizer_class_from_name(class_name: str):

are more of a hack than a permanent solution IMO.
We currently don't document those classes. I guess we could but it's just a mapping.

=> I would propose that we change all "MAPPING_FOR_..." to a class MAPPING_FOR_ where we make sure that 100% backward compatibility is kept (except for that now it's not anymore a OrderedDict anymore, but a class.)

We can implement a __getitem__ that could take inputs of different types (config for backward comp, but maybe also a "str" corresponding to the "tokenizer_class" or "model_type"). In general, it would give us more flexibility and prevent errors such as the one linked to this PR.

A possible design could look like this:

class MappingGenerator:

    def __init__(self, keys_to_values: List[Tuple[Union[PretrainedConfig, str, Any]]]):
         self.tuple_to_class = OrderedDict(keys_to_values)
         all_configs = [keys_to_value[0] for keys_to_value in keys_to_values]
         self.duplicated_configs = set([x for x in all_configs if all_configs.count(x) > 1])
         self.config_to_class = OrderedDict([(keys_to_value[0], keys_to_value[2]) for keys_to_value in keys_to_values])
         # not possible to have key conflicts here
         self.str_to_class = OrderedDict([(keys_to_value[1], keys_to_value[2]) for keys_to_value in keys_to_values])

    def __getattr__(key: Union[PretrainedConfig, str, Tuple[PretrainedConfig, str]]):
         if isintance(key, str):
             return self.str_to_class[key]
         elif isinstance(key, PretrainedConfig):
             if key in self.duplicade_configs:
                 raise ...
             return self.config_to_class[key]
        elif isinstance(key, Tuple):
             return self.tuple_to_class[key]
        raise ...

TOKENIZER_MAPPING = MappingGenerator([
    (BertConfig, "BertTokenizer", BertTokenizer),
    (GPT2Config, "GPT2Tokenizer", GPT2Tokenizer),
    ...,
])

Keen to hear your thoughts on this @LysandreJik, @sgugger, @julien-c before opening a PR.

LysandreJik · 2021-01-04T09:02:55Z

You're right that the current design is sub-optimal, especially for the tokenizers since we have introduced tokenizers decoupled from models.

Having this approach would imply modifying most configuration files on the hubs, given that you use the approach
```
(config.class, config.tokenizer_class) -> ...
```
as most models configurations have no tokenizer class defined.
The isinstance should be replaced by type imo, which would prevent having such a test

Overall I'm definitely not against refactoring this part to ensure better compatibility, but let's try to find a way of making sure we don't have to update 1000s of configurations on the hub. Maybe adding a tokenizer_class = XXXTokenizer field in the configurations would prevent this.

patrickvonplaten · 2021-01-04T10:04:52Z

You're right that the current design is sub-optimal, especially for the tokenizers since we have introduced tokenizers decoupled from models.
Having this approach would imply modifying most configuration files on the hubs, given that you use the approach
(config.class, config.tokenizer_class) -> ...
as most models configurations have no tokenizer class defined.
The isinstance should be replaced by type imo, which would prevent having such a test
Overall I'm definitely not against refactoring this part to ensure better compatibility, but let's try to find a way of making sure we don't have to update 1000s of configurations on the hub. Maybe adding a tokenizer_class = XXXTokenizer field in the configurations would prevent this.

Sorry, I think my explanation wasn't very clear above - I modified the description. I didn't mean to force configs to have a tokenizer_class attribute. The idea was just that the TOKENIZER_MAPPING should expose a function that allows one to get the correct tokenizer not only by the config but also by the tokenizer_class as a string. So the idea is that we could replace the current TOKENIZER_MAPPING with a class like as (now) shown above, but then this class can be used in whatever way is best by AutoTokenizer, e.g. the AutoTokenizer's from_pretrained(...) method could then call the TOKENIZER_MAPPING class above as follows:

if hasattr(config, tokenizer_class):
   tokenizer = TOKENIZER_MAPPING[(config, config.tokenizer_class)]
else
   tokenizer = TOKENIZER_MAPPING[config]

sgugger · 2021-01-04T19:40:49Z

I'd need to see a PoC to be sure, but this looks like an interesting idea to me. There are certainly big limitations in the way those AUTO variables are currently structured.

github-actions · 2021-04-15T15:04:58Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

show bad desin implementation

6bb5f78

patrickvonplaten marked this pull request as draft December 25, 2020 10:12

del config

e2c5f8d

patrickvonplaten changed the title ~~[Don't merge]~~ [Don't merge] New design proposition for MAPPINGS in "auto" files Dec 25, 2020

patrickvonplaten requested a review from LysandreJik December 25, 2020 11:33

patrickvonplaten assigned sgugger and unassigned sgugger Dec 25, 2020

patrickvonplaten requested a review from sgugger December 25, 2020 14:38

patrickvonplaten mentioned this pull request Jan 12, 2021

[Announcement] Changing model type of Barthez #9422

Closed

patrickvonplaten closed this Apr 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Don't merge] New design proposition for MAPPINGS in "auto" files #9305

[Don't merge] New design proposition for MAPPINGS in "auto" files #9305

patrickvonplaten commented Dec 25, 2020 •

edited

Loading

LysandreJik commented Jan 4, 2021

patrickvonplaten commented Jan 4, 2021 •

edited

Loading

sgugger commented Jan 4, 2021

github-actions bot commented Apr 15, 2021

[Don't merge] New design proposition for MAPPINGS in "auto" files #9305

[Don't merge] New design proposition for MAPPINGS in "auto" files #9305

Conversation

patrickvonplaten commented Dec 25, 2020 • edited Loading

LysandreJik commented Jan 4, 2021

patrickvonplaten commented Jan 4, 2021 • edited Loading

sgugger commented Jan 4, 2021

github-actions bot commented Apr 15, 2021

patrickvonplaten commented Dec 25, 2020 •

edited

Loading

patrickvonplaten commented Jan 4, 2021 •

edited

Loading