Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Don't merge] New design proposition for MAPPINGS in "auto" files #9305

Closed

Conversation

patrickvonplaten
Copy link
Contributor

@patrickvonplaten patrickvonplaten commented Dec 25, 2020

This PR would solve the issue: #9250 but should not be used as a solution.

The PR should rather just show how the current design of all OrderedDicts, called MAPPINGS_... is suboptimal. It's impossible to add two values if both values have the same key. We need to be able to add a tokenizer class to AutoTokenizers even if the tokenizer does not have its own unique configuration class. We had a similar problem for Rag, since there is RagForSequenceGeneration and RagForTokenGeneration which both should be in the same mapping. IMO, the only 100% where we prevent "key" conflicts is if we use "multi-key" to "value" mappings as follows:

Tokenizer:
(PretrainedConfig (the corresponding config class, we're using now), str (the tokenizer class as a string, sometimes saved under config.tokenizer_class) -> TokenizerClass

Model:
(PretrainedConfig (the corresponding config class, we're using now), str (the model type as a string, sometimes saved under config.model_type) -> ModelClass

Some other "less" important shortcomings of this design:

  • Because we often check with isinstance whether a config class is in an OrderedDict, we need to be very careful about the position of the key in the ordered dict and even wrote a test for this:
    def test_parents_and_children_in_mappings(self):
    . This added complexity for such a simple feature is quite unnecessary IMO.
  • These functions:
    def tokenizer_class_from_name(class_name: str):
    are more of a hack than a permanent solution IMO.
  • We currently don't document those classes. I guess we could but it's just a mapping.

=> I would propose that we change all "MAPPING_FOR_..." to a class MAPPING_FOR_ where we make sure that 100% backward compatibility is kept (except for that now it's not anymore a OrderedDict anymore, but a class.)

We can implement a __getitem__ that could take inputs of different types (config for backward comp, but maybe also a "str" corresponding to the "tokenizer_class" or "model_type"). In general, it would give us more flexibility and prevent errors such as the one linked to this PR.

A possible design could look like this:

class MappingGenerator:

    def __init__(self, keys_to_values: List[Tuple[Union[PretrainedConfig, str, Any]]]):
         self.tuple_to_class = OrderedDict(keys_to_values)
         all_configs = [keys_to_value[0] for keys_to_value in keys_to_values]
         self.duplicated_configs = set([x for x in all_configs if all_configs.count(x) > 1])
         self.config_to_class = OrderedDict([(keys_to_value[0], keys_to_value[2]) for keys_to_value in keys_to_values])
         # not possible to have key conflicts here
         self.str_to_class = OrderedDict([(keys_to_value[1], keys_to_value[2]) for keys_to_value in keys_to_values])

    def __getattr__(key: Union[PretrainedConfig, str, Tuple[PretrainedConfig, str]]):
         if isintance(key, str):
             return self.str_to_class[key]
         elif isinstance(key, PretrainedConfig):
             if key in self.duplicade_configs:
                 raise ...
             return self.config_to_class[key]
        elif isinstance(key, Tuple):
             return self.tuple_to_class[key]
        raise ...

TOKENIZER_MAPPING = MappingGenerator([
    (BertConfig, "BertTokenizer", BertTokenizer),
    (GPT2Config, "GPT2Tokenizer", GPT2Tokenizer),
    ...,
])

Keen to hear your thoughts on this @LysandreJik, @sgugger, @julien-c before opening a PR.

@patrickvonplaten patrickvonplaten marked this pull request as draft December 25, 2020 10:12
@patrickvonplaten patrickvonplaten changed the title [Don't merge] [Don't merge] New design proposition for MAPPINGS in "auto" files Dec 25, 2020
@patrickvonplaten patrickvonplaten assigned sgugger and unassigned sgugger Dec 25, 2020
@LysandreJik
Copy link
Member

You're right that the current design is sub-optimal, especially for the tokenizers since we have introduced tokenizers decoupled from models.

  • Having this approach would imply modifying most configuration files on the hubs, given that you use the approach

    (config.class, config.tokenizer_class) -> ...
    

    as most models configurations have no tokenizer class defined.

  • The isinstance should be replaced by type imo, which would prevent having such a test

Overall I'm definitely not against refactoring this part to ensure better compatibility, but let's try to find a way of making sure we don't have to update 1000s of configurations on the hub. Maybe adding a tokenizer_class = XXXTokenizer field in the configurations would prevent this.

@patrickvonplaten
Copy link
Contributor Author

patrickvonplaten commented Jan 4, 2021

You're right that the current design is sub-optimal, especially for the tokenizers since we have introduced tokenizers decoupled from models.

  • Having this approach would imply modifying most configuration files on the hubs, given that you use the approach

    (config.class, config.tokenizer_class) -> ...
    

    as most models configurations have no tokenizer class defined.

  • The isinstance should be replaced by type imo, which would prevent having such a test

Overall I'm definitely not against refactoring this part to ensure better compatibility, but let's try to find a way of making sure we don't have to update 1000s of configurations on the hub. Maybe adding a tokenizer_class = XXXTokenizer field in the configurations would prevent this.

Sorry, I think my explanation wasn't very clear above - I modified the description. I didn't mean to force configs to have a tokenizer_class attribute. The idea was just that the TOKENIZER_MAPPING should expose a function that allows one to get the correct tokenizer not only by the config but also by the tokenizer_class as a string. So the idea is that we could replace the current TOKENIZER_MAPPING with a class like as (now) shown above, but then this class can be used in whatever way is best by AutoTokenizer, e.g. the AutoTokenizer's from_pretrained(...) method could then call the TOKENIZER_MAPPING class above as follows:

if hasattr(config, tokenizer_class):
   tokenizer = TOKENIZER_MAPPING[(config, config.tokenizer_class)]
else
   tokenizer = TOKENIZER_MAPPING[config]

@sgugger
Copy link
Collaborator

sgugger commented Jan 4, 2021

I'd need to see a PoC to be sure, but this looks like an interesting idea to me. There are certainly big limitations in the way those AUTO variables are currently structured.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants